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Chapter  1 


Introduction 


1.1  Motivation 

It  is  widely  believed  that  parallel  computation  will  be  the  basis  for  the  next  major 
advance  in  computing  speed  [45].  However,  several  difficult  problems  need  to  be 
solved.  Two  of  these  problems  are  addressed  in  this  dissertation.  The  first  problem 
is  the  design  of  execution  models,  or  interpreters,  that  allow  desirable  types  of 
parallelism  to  be  exploited  for  certain  types  of  computations.  The  second  problem 
is  the  design  of  a  resource  allocator  to  map  the  parallel  computation  to  hardware 
resources  for  processing,  storage,  and  communication. 

Optimal  allocation  is  ruled  out  as  a  viable  option  because  even  simplistic  com¬ 
putation  and  multiprocessor  models  make  the  problem  NP-complete  or  worse  [43]. 
A  practical  strategy  must  have  the  following  characteristics:  (1)  hard  limits  on  re¬ 
sources  must  be  observed,  (2)  trade-offs  must  be  made  among  the  three  types  of 
hardware  resources  for  processing,  storage,  and  communication,  and  (3)  the  algo¬ 
rithms  used  for  accomplishing  resource  allocation  must  themselves  be  reasonably 
efficient. 

The  type  of  computation  being  considered  in  this  thesis  is  backward-chaining 
deduction  [6].  This  is  the  type  of  deduction  employed  in  most  extant  logic  program¬ 
ming  languages.  Prolog  is  a  prime  example. 
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CHAPTER  1.  INTRODUCTION 


Logical  deduction  is  particularly  attractive  as  a  starting  point  for  exploiting 
parallelism  because  (1)  it  has  a  well  understood  semantics  that  is  completely  inde¬ 
pendent  of  any  computer  architecture,  be  it  sequential  or  parallel,  and  (2)  it  is  not 
necessary  for  the  programmer  to  be  burdened  with  explicitly  specifying  the  par¬ 
allelism  or  for  the  interpreter /compiler  to  use  complex  techniques  to  uncover  the 
parallelism.  These  two  advantages  together  imply  that  the  programmer  can  pro¬ 
gram  approximately  as  he  would  with  a  sequential  computer.  We  say  approximately 
because  optional  pragmas  (or  hints)  may  be  given  by  the  programmer  to  increase 
the  efficiency  of  parallel  execution.  This  is  analogous  to  the  situation  in  which  the 
programmer  may  do  some  explicit  memory  allocation/deallocation  and  leave  most 
of  the  memory  reclamation  task  to  the  garbage  collector. 

The  rest  of  this  chapter  is  organized  as  follows.  Section  1.2  describes  backward- 
chaining  deduction.  The  next  two  sections  describe  the  types  of  parallelism  that  can 
be  exploited  for  this  computation  (section  1.3)  and  what  is  necessary  to  describe  a 
parallel  execution  model  (section  1.4).  Section  1.5  describes  the  class  of  multipro¬ 
cessors  considered  in  this  thesis  and  gives  some  background  on  FAIM-1,  the  specific 
multiprocessor  that  was  used  for  experimentation.  Section  1.6  gives  the  definition 
of  the  allocation  problem  and  some  background  on  previous  allocation  research. 
Section  1.7  describes  the  overall  structure  of  the  allocator  that  is  described  in  detail 
later  in  the  thesis.  Finally,  the  last  section  presents  the  overall  structure  of  the  rest 
of  the  chapters  in  this  thesis. 


1.2  Backward- Chaining  Deduction 

Backward-chaining  [6]  is  an  inference  mechanism  for  automated  deduction.  It  is 
used  here  in  the  context  of  a  database  of  Horn  clauses.  An  example  of  a  Horn 
clause  is: 

H  Tl,T2,...,Tn 

where  H  and  all  Ti  are  positive  literals  (i.e.,  relation  symbols  with  a  list  of  terms). 
A  term  is  a  constant,  a  variable  or  a  function  of  some  terms.  All  variable  names 
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will  start  with  an  uppercase  letter.  All  constants,  function  symbols  and  predicate 
symbols  will  start  with  a  lowercase  letter.  H  is  also  called  the  head  of  the  rule  and 
the  set  T1,T2,. . .  ,Tn  is  called  the  tail  of  the  rule.  The  meaning  of  the  above  Horn 
clause  is: 


H  is  true  if  all  of  T1,T2,. . .  ,and  Tn  are  true. 

By  definition,  if  H  is  non-null,  the  clause  is  called  an  assertion  (or  fact)  when  n 
is  zero  and  it  is  called  a  rule  when  n  is  greater  than  0.  If  H  is  null,  the  clause  is 
called  a  goal.  For  example, 


Gl,G2,. .  .,Gm 

is  a  goal  with  m  literals.  The  meaning  of  this  Horn  clause  is  that  the  conjunction 
given  below  needs  to  be  solved. 


G1AG2A  . . .  AGto 

Solving  a  goal  means  proving  it  true  or  false  (in  the  sense  of  logical  implication). 
If  the  goal  is  true,  then  values  of  the  variables  in  the  goal  that  make  it  true  must 
be  given.  The  value  of  a  variable  is  called  a  binding.  A  set  of  values  for  a  set  of 
variables  is  called  a  substitution. 

An  and- or  tree  is  a  problem  reduction  representation  [7]  used  to  represent  the 
problem  of  proving  a  goal  by  backward-chaining.  Figure  1  shows  an  example  of 
a  syntactic  and-or  tree  used  to  represent  a  backward-chaining  deduction.  In  this 
figure,  ovals  denote  or-nodes  and  boxes  denote  and-nodes.  And-nodes  get  their 
name  because  the  goal  they  represent  is  one  conjunct  in  a  conjunctive  goal  set. 
Similarly,  or-nodes  represent  a  disjunct  in  a  disjunctive  goal  set.  Nilsson  [48]  gives 
a  more  formal  characterization  of  and-or  trees.  Arcs  are  marked  with  the  number 
of  the  clause  used  for  the  reduction.  Also,  a  cut  through  the  arcs  going  from  a  node 
to  its  children  indicates  that  the  children  are  and-nodes.  The  leaf  nodes  cannot 
be  reduced.  Leaf  nodes  may  be  empty  boxes.  These  denote  empty  goals  (i.e., 
successes).  All  leaf  nodes  in  the  example  are  empty  goals.  In  other  cases,  a  non¬ 
empty  leaf  indicates  a  failure.  A  logical  inference  is  defined  as  the  reduction  of  a  goal 
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by  a  rule.  In  this  example,  substitutions  that  make  the  goal  true  are  {X=a,Y=b} 
and  {X=b,Y=a}.  The  discerning  reader  will  notice  that  the  former  substitution 
can  be  obtained  in  two  different  ways  (of  proving  the  top  level  goal). 

We  call  the  tree  syntactic  because  certain  subtrees  may  be  instantiated  multiple 
times  during  an  actual  execution.  For  example,  if  conjuncts  are  solved  left  to  right, 
multiple  solutions  to  “p(X)”  will  lead  to  as  many  instantiations  of  the  subtree  rooted 
at  “q(Y)”. 

Some  of  the  leaf  nodes  in  the  and-or  tree  end  in  failure  and  others  end  in  success. 
The  purpose  of  the  backward-chaining  inference  procedure  is  to  find  either  one  or  all 
nodes  associated  with  success  in  the  and-or  tree.  Each  node  represents  a  solution. 
Therefore,  the  computation  is  a  search  problem.  In  this  thesis,  we  restrict  our 
attention  to  the  case  in  which  all  solutions  are  desired. 

The  most  widely  used  sequential  interpretation  is  the  one  used  by  Prolog.  The 
search  through  the  tree  is  a  depth-first,  left-to- right  search.  Search  is  suspended 
for  solutions  to  a  subgoal  when  one  solution  is  found.  Search  continues  for  the 
next  solution  by  chronological  backtracking  from  the  next  conjunct.  When  the 
first  answer  is  obtained  to  the  top  level  goal,  it  is  announced.  If  more  solutions  are 
demanded,  the  search  continues.  Parallel  approaches  to  interpretation  are  discussed 
in  the  next  chapter.  In  particular,  a  parallel  execution  model  called  PM  is  described. 
PM  exploits  more  types  of  parallelism  than  other  execution  models  that  use  data- 
driven  control  and  non-shared  memory  multiprocessor  architectures. 

The  computation  studied  in  this  thesis  is  very  similar  to  Prolog  but  not  identical. 
In  particular,  a  couple  of  features  that  are  part  of  Prolog  are  not  allowed  here.  First, 
Prolog  programs  can  change  the  database  of  horn  clauses.  Side-effects  of  this  type 
are  not  allowed  in  this  thesis.  Second,  Prolog  programs  allow  “cuts”— a  construct 
used  to  prune  part  of  the  search  space.  “Cuts”  are  not  allowed  in  this  thesis.  The 
allocation  strategy  imposes  additional  restrictions  on  the  computation  as  will  be 
seen  later. 
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Figure  1:  A  Syntactic  And-Or  Tree 
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1.3  Types  of  Parallelism 

Several  types  of  parallelism  have  been  described  in  the  literature.  The  list  below 
may  not  be  exhaustive  but  covers  the  well-known  types. 

1.  Or-parallelism:  This  is  the  solution  of  multiple  or-nodes  in  parallel.  There 
is  some  disagreement  in  the  literature  about  the  exact  meaning  of  this.  The 
most  commonly  used  meaning  [41],  and  the  one  used  in  this  thesis,  is  that 
the  entire  search  trees  rooted  at  the  or-nodes  can  be  searched  in  parallel.  In 
figure  1,  the  two  sub-trees  rooted  at  the  two  children  or-nodes  of  the  and-node 
“r(X,Y)”  can  be  searched  in  parallel  using  or-parallelism. 

Conery  [17,16]  uses  a  slightly  different  meaning  of  or-parallelism.  He  defines 
or-parallelism  as  the  assignment  of  a  process  to  each  or-node.  Presumably, 
this  meaning  is  neutral  about  the  parallel  search  of  the  rest  of  the  sub-trees 
below  the  or-nodes. 

2.  And-parallelism:  This  is  the  parallel  solution  of  sibling  and-nodes.  Note  that 
this  does  not  mean  that  the  and-nodes  must  be  solved  in  isolation  from  each 
other  or  that  they  must  all  be  solved  in  parallel.  In  figure  1,  the  and-nodes 
“p(X)”  and  “q(Y)”  may  be  solved  in  parallel  using  and-parallelism. 

3.  Pipelining:  This  is  the  continuous  streaming  of  complete  solutions  from  one 
and-node  to  another.  This  is  useful  when  two  and-nodes  must  be  solved  in 
sequence.  For  example,  pipelining  allows  the  first  solution  of  a  source  and- 
node  to  be  sent  to  a  destination  and-node  and  allows  the  parallel  search  for 
(1)  the  first  solution  of  the  destination  and-node  and  (2)  the  second  solution 
of  the  source  and-node.  In  figure  1,  it  may  be  the  case  that  the  and-nodes 
“m(X)”  and  “n(X,Y)”  are  solved  in  sequence.  Using  pipelining,  solutions  of 
“m(X)”  can  be  streamed  continuously  to  “n(X,Y)’\  The  search  for  consistent 
solutions  for  “n(X,Y)”  can  begin  as  soon  as  a  solution  of  “m(X)”  is  received. 

4.  Search-parallelism:  This  is  the  parallel  reduction  of  an  and-node  to  its  children 
or-nodes.  The  term  “search”  refers  to  the  search  for  clauses  whose  heads  unify 


1.4.  PARALLEL  EXECUTION  MODELS  7 

with  the  and-node.  The  actual  solution  of  the  or-nodes  in  parallel  is  called  or- 
parallelism  (as  defined  above).  In  figure  1,  the  literal  “r(X,Y)”  can  be  unified 
with  the  heads  of  the  two  relevant  rules  in  parallel. 

5.  Stream-parallelism:  Conery  [17]  defines  this  as  the  “eager  evaluation  of  struc¬ 
tured  data,  which  can  be  treated  as  a  stream”.  Conery  cites  the  example  of 
testing  for  membership  in  a  list  while  the  list  is  still  being  constructed.  There 
is  no  example  of  this  in  figure  1  and  this  type  of  parallelism  is  not  considered 
in  this  thesis.  Examples  of  this  can  be  found  in  the  work  of  Shapiro  [57] 
among  others. 

6.  Unification-parallelism:  This  is  the  parallelism  associated  with  the  unifica¬ 
tion  of  two  literals.  It  has  been  shown  that  this  problem  is  inherently  non- 
parallelizable  [20,74]  (since  it  falls  outside  the  problem  class  NC  unless  NC  = 
FP).  In  attempting  to  exploit  unification-parallelism,  the  hope  is  that  prac¬ 
tical  cases  of  unification-parallelism  are  more  amenable  to  speedup.  Again, 
this  type  of  parallelism  is  not  considered  in  this  thesis.  Examples  of  this  can 
be  found  in  the  work  of  Citrin  [13]  and  Robinson  [53]  among  others. 

1.4  Parallel  Execution  Models 

A  Parallel  Execution  Model  for  a  sequential  program  and  a  multiprocessor  contains 
the  specification  of  (1)  methods  to  generate  a  set  of  parallel  processes,  (2)  the  state, 
procedures,  and  inter-process  communication  for  the  set  of  processes,  and  (3)  any 
constraints  placed  on  how  the  set  of  processes  must  be  run  on  the  processors  in  the 
multiprocessor. 

The  Parallel  Execution  Model  is  correct  iff  it  produces  the  same  solutions  as  the 
sequential  program. 

Same  can  mean  the  same  set  of  solutions  or  the  same  ordered  set  of  solutions. 
In  this  thesis,  we  use  the  former  meaning  (i.e.,  the  order  in  which  the  solutions  are 
produced  is  not  considered  significant). 
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For  example,  the  set  of  parallel  processes  shown  in  figure  2  might  be  able  to 
perform  the  backward-chaining  associated  with  the  and-or  tree  shown  in  figure  1. 
The  arrows  in  the  figure  show  communication  of  data  or  control.  As  the  figure  also 
shows,  the  state,  procedures,  and  messages  associated  with  process  “s(X,Y)”,  as 
well  as  all  other  processes,  must  be  specified.  In  our  case,  the  parallel  execution 
model  is  said  to  be  correct  iff  the  set  of  solutions  produced  by  it  is  equal  to  the  set 
of  solutions  produced  by  the  Prolog  interpreter  as  described  in  section  1.2. 

A  parallel  execution  model  needs  to  exploit  as  much  parallelism  as  possible  while 
not  being  too  complicated  or  expensive  (in  time  and  space)  to  be  practical.  These 
two  requirements  are  clearly  inconsistent,  in  general,  and  a  reasonable  tradeoff  must 
be  made. 

A  dataflow  representation  of  the  computation  is  desirable  for  exploiting  concur¬ 
rency.  There  are  at  least  two  important  reasons.  First,  a  dataflow  representation 
of  a  computation  makes  all  its  parallelism  explicit.  Second,  it  has  been  argued 
convincingly  that  reasoning  about  dataflow  programs  for  purposes  of  proving  cor¬ 
rectness  properties  and  allocation  is  easier  than  reasoning  about  other  procedural 
representations  [5,11]. 

Although,  a  dataflow  representation  is  desirable,  it  is  not  so  at  any  cost.  For 
example,  Fortran  programs  may  be  reformulated  as  dataflow  programs  but  at  the 
cost  of  extensive  copying  of  structures.  The  same  argument  holds  for  any  other 
procedural  representation  that  allows  modification  of  global  state.  Fortunately,  for 
logic  programs,  it  has  been  shown  that  they  can  be  represented  easily  as  dataflow 
programs  (with  indeterminate  merge)  if  the  types  of  parallelism  to  be  exploited 
are  or-parallelism  and  pipelining  only  (see  work  by  Ciepielewski  and  Haridi  [12], 
Lindstrom  and  Panangaden  [41],  and  Singh  and  Genesereth  [61]).  Conery  [15]  has 
shown  how  to  exploit  or-parallelism  and  a  restricted  form  of  and-parallelism,  but 
not  pipelining.  However,  the  control  mechanism  was  not  data-driven  in  nature,  but 
was  a  variant  on  the  sequential  backtracking  mechanism  of  Prolog.  PM,  the  parallel 
execution  model  presented  in  this  thesis,  shows  how  to  exploit  all  three  types  of  par¬ 
allelism,  or-parallelism,  pipelining,  and  the  same  restricted  form  of  and-parallelism 
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Figure  2:  A  Parallel  Execution  Model 
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described  by  Conery,  while  still  using  a  data-driven  solution.  However,  one  more  ex¬ 
tension,  local  state,  had  to  be  made  to  dataflow  (other  than  indeterminate  merge). 
Local  state  makes  the  programs  harder  to  reason  about  but  the  hope  is  that  the 
reasoning  is  still  fax  easier  than  it  is  for  arbitrary  procedural  representations  with 
global  state  (like  Fortran).  The  resource  allocation  algorithms  described  in  this 
thesis  illustrate  this  ease  of  reasoning  to  some  extent. 

On  a  different  note,  an  important  design  consideration  for  the  parallel  execution 
model  came  from  the  target  multiprocessor  class.  As  mentioned  before,  any  single 
processor  may  not  have  enough  memory  to  store  the  entire  program.  Parallel  exe¬ 
cution  models  like  the  Variable  Supply  Model  [61]  that  require  a  complete  copy  at 
each  processor  are  disallowed. 


1.5  Target  Multiprocessor  Class 

The  target  class  of  multiprocessors  for  this  dissertation  satisfies  the  following  prop¬ 
erties: 

•  There  are  an  arbitrary,  finite  number  of  identical  MIMD  (multiple  instruction 
stream,  multiple  data  stream)  [22]  processors.  No  assumption  is  made  about 
the  speed  of  these  processors. 

•  Each  processor  has  a  finite  amount  of  local  memory;  there  is  no  global  (or 
shaxed)  memory.  No  assumption  is  made  about  the  memory  size  except  that 
the  entire  database  must  fit  in  the  collection  of  memories  of  the  processors 
in  the  system.  The  database  is  distributed  over  the  processors.  Parts  of  the 
database  may  be  replicated. 

•  Processors  are  connected  with  some  interconnection  topology.  They  can  com¬ 
municate  only  by  sending  messages  to  each  other. 

•  Message  delay  is  some  function  of  the  amount  of  data  in  the  message  and 
the  distance  between  source  and  destination.  In  general,  if  the  source  and 
destination  are  not  identical,  there  will  be  some  non-zero  delay. 
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•  Each  processor  can  perform  backward-chaining  deductions  based  on  the  subset 
of  the  database  that  it  contains. 

An  architecture  that  satisfies  the  multiprocessor  scenario  described  above  is 
FAIM-1  [18,68] -1  Quoting  from  one  of  the  papers,  the  FAIM-1  architecture  is 
claimed  to  be  “consistent  with  high  performance  VLSI  implementation  and  packag¬ 
ing  technology,  and  is  easily  extended  to  include  arbitrary  numbers  of  processors”. 
Another  architecture  that  would  fit  the  requirements  is  the  Cosmic  Cube  [56]. 

Multiprocessors  that  do  not  fall  in  this  class  are  the  Encore  Multimax  [46]  and 
the  Connection  Machine  [35] — the  Multimax  because  it  is  a  shared-memory  machine 
and  the  Connection  Machine  because  it  is  a  SIMD  (single  instruction  stream,  mul¬ 
tiple  data  stream)  machine.  However,  it  may  be  possible  to  make  shared-memory 
multiprocessors  like  the  Encore  Multimax  [46]  behave  like  message-passing  multi¬ 
processors  by  making  appropriate  changes  to  the  operating  systems. 

All  the  experiments  described  in  this  dissertation  were  done  using  a  simulation 
of  the  FAIM-1  multiprocessor.  At  the  level  of  abstraction  used  in  the  simulation, 
the  multiprocessor  is  composed  of  a  variable  number  of  homogeneously  replicated 
processing  elements  connected  together  with  a  3-axis  variant  of  a  twisted-torus.  A 
processing  element  is  a  processor  with  its  own  local  memory.  A  19  processor  version 
would  have  the  topology  shown  in  figure  3.  The  topology  is  called  an  E-3  surface 
because  there  are  3  processing  elements  on  each  hexagonal  edge.  For  the  sake  of 
simplicity,  wrap-around  connections  for  just  one  axis  are  shown.  In  the  complete 
topology,  two  extra  wires  are  connected  to  each  processing  element  on  the  edge. 
Each  processor  ends  up  having  6  connections  to  its  neighbors  and  a  completely 
identical  topological  view  of  the  rest  of  the  processors.  Quoting  from  the  paper 
by  Stevens  [68],  “this  folding  scheme  results  in  ...  a  provably  minimal  diameter 
for  hexagonal  meshes.”  Another  good  feature  of  this  topology  is  its  scaleability. 
The  number  of  processing  elements  on  a  surface  is  given  by  ZE(E  —  1)  + 1,  where  E 
represents  the  E-size,  or  the  number  of  processing  elements  on  each  edge.  Therefore, 
the  numbers  of  processors  on  different  sizes  of  surfaces  can  be  1,7,19,37,61  and  so 

1  We  are  assuming,  of  course,  that  each  processor  will  have  the  appropriate  software  to  do 
backward-chaining  deductions. 
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Figure  3:  E-3  Processing  Surface  for  FAIM-1 


on. 

The  FAIM-1  multiprocessor  has  not  been  built  yet  but  some  rough  estimates 
of  its  expected  performance  and  configuration  are  given  below.  Each  processor  is 
medium-grained,  larger  than  a  Connection  Machine  [35]  processor  but  smaller  than 
a  Symbolics  3600  workstation  [44].  Each  processor  in  the  FAIM-1  multiprocessor 
is  expected  to  perform  at  20  KLIPS  (1  KLIPS  =  1  thousand  logical  inferences  per 
second).  Each  processor  will  contain  approximately  5  megabytes  of  memory  dis¬ 
tributed  over  several  specialized  memory  types.  Communication  delay  is  expected 
to  be  (2  +  2n  +  d)  microseconds,  where  n  is  the  number  of  packets  in  the  message 
and  d  is  the  distance  in  hops  from  the  source  of  the  message  to  its  destination.  The 
packet  size  is  8  words  and  a  word  is  24  bits  wide. 


1.6  The  Allocation  Problem 

We  will  assume  for  now  that  the  computation  is  represented  by  a  directed,  acyclic 
graph  (or  DAG).  Semantically,  the  graph  is  a  dataflow  graph  with  two  exceptions. 
First,  indeterminate  merges  are  allowed.  Second,  the  nodes  may  have  associated 
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local  state  and  may  manipulate  this  local  state.  However,  in  keeping  with  dataflow 
semantics,  all  computation  is  data-driven  (i.e.,  triggered  off  at  nodes  by  messages 
received  along  the  arcs).  This  type  of  graph  will  be  called  a  dataflow*  graph  in  this 
thesis.  The  name  indicates  the  similarity  to  dataflow  and  the  “*”  indicates  that 
it  is  slightly  different  from  dataflow.  It  will  be  shown  in  chapter  2  that  PM,  the 
parallel  execution  model,  is  based  on  dataflow*  graphs. 


The  allocation  problem  can  be  defined  precisely  now.  It  is  finding  the  many-to- 
one  mapping  from  the  set  of  nodes  in  the  dataflow*  graph  to  the  set  of  processors 
that  gives  the  minimum  completion  time. 


Since  the  precedence  constraints  associated  with  the  computation  DAG  can 
be  arbitrary  (as  can  be  seen  later  in  chapter  2),  this  allocation  problem  is  NP- 
complete  because  a  known  NP-complete  problem,  namely  Precedence  Constrained 
Scheduling  [27],  is  a  special  case  (in  which  communication  delays  are  assumed  to 
be  zero).  It  turns  out  that  even  more  structured  computations  are  NP-complete 
[43].  In  any  case,  the  implication  for  this  thesis  is  that  finding  the  optimal  solution 
is  impractical.  Therefore,  the  allocation  strategy  suggested  by  this  thesis  is  sub- 
optimal.  However,  the  allocation  algorithms  used  are  shown  to  be  polynomial-time 
in  their  worst  case  complexity.  Yet,  the  allocations  generated  are  found  to  exploit 
much  of  the  parallelism  present  in  the  logic  programs. 


In  chapter  2,  it  will  be  seen  that  each  node  in  the  dataflow*  graph  is  associated 
with  a  certain  subset  of  the  database,  where  the  set  of  subsets  is  mutually  exclusive 
and  exhaustive.  We  will  use  the  term  partition  for  each  of  these  subsets  although 
this  use  of  the  term  is  a  bit  non-standard.  Instead  of  thinking  of  the  allocation 
in  terms  of  mapping  nodes  of  the  dataflow*  graph  to  processors,  we  can  think  of 
it  as  mapping  partitions  of  the  database  to  processors.  Some  partitions  may  be 
replicated  for  additional  parallelism.  Therefore,  the  mapping  of  database  partitions 
to  processors  will  be  many-to-many  in  general. 
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1.7  Allocation  Strategy 


The  allocation  strategy  described  in  this  thesis  is  a  compile-time  (or  static )  alloca¬ 
tion  strategy.  In  other  words,  the  compiler  makes  the  decisions  involved  in  map¬ 
ping  tasks  to  processors.  This  strategy  is  in  contrast  to  (1)  run-time  (or  dynamic ) 
allocation,  in  which  the  run-time  or  operating  system  performs  the  allocation  or  re¬ 
allocation,  or  (2)  programmed  (or  user- defined)  allocation,  in  which  the  user  specifies 
the  allocation.  Compile-time  allocation  is  not  expected  to  be  the  best  solution  for 
all  applications  but  it  does  compare  favorably  to  the  other  two  types  in  some  ways. 
The  disadvantage  of  run-time  allocation  is  that  the  overhead  is  paid  at  run-time 
and  it  may  be  unacceptable.  However,  if  the  program  behavior  is  highly  dynamic 
and  is  hard  to  predict  at  compile-time,  this  may  be  the  best  approach.  The  dis¬ 
advantage  of  programmed  allocation  is  that  it  places  a  big  burden  on  the  user  and 
the  allocation  probably  has  to  be  repeated  for  every  new  machine  architecture.  The 
advantage,  of  course,  is  that  the  user  may  know  much  more  about  his  program  and 
how  to  allocate  it  than  an  automatic  allocator.  Of  course,  features  of  all  three  types 
of  allocation  may  be  combined.  Given  that  so  little  is  known  about  practical  alloca¬ 
tion  strategies,  and  almost  nothing  about  hybrid  strategies,  this  thesis  concentrates 
on  pure  compile-time  allocation.  For  logic  programming,  in  particular,  I  do  not 
know  about  any  work  on  compile-time  allocation  so  far. 

The  (possibly)  limited  memory  size  of  a  processor  affects  the  resource  allocation 
strategy  also.  Allocation  strategies  like  the  one  described  by  Sarkar  [55],  which 
depend  on  each  processor  being  able  to  execute  the  entire  program,  are  unacceptable 
here. 

The  allocation  strategy  described  in  this  thesis  needs  some  restrictions  that 
PM  does  not  require.  First,  the  type  of  backward-chaining  deduction  is  restricted. 
In  particular,  no  recursive  clauses  are  allowed,  unit  clauses  must  be  ground,  and 
certain  probabilistic  uniformity  and  independence  assumptions  must  apply.  Second, 
a  partitioning  of  the  database  is  assumed  to  be  given. 

Figure  4  gives  a  high-level  view  of  the  allocator  strategy.  There  are  two  main 
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modules.  One  module,  called  the  allocator  module,  performs  the  search  for  a  suit¬ 
able  allocation.  Of  course,  since  the  search  space  is  exponential,  only  a  small  part 
of  it  can  be  explored.  The  other  module,  called  the  cost  computation  module,  com¬ 
putes  the  cost  of  a  particular  allocation  being  considered.  Cost  is  a  number  that 
captures  the  relative  poorness  of  an  allocation. 

The  cost  function  is  formally  defined  and  domain-independent  (or  application- 
independent).2  All  the  domain-dependent  information  required  is  given  in  the  input 
Goal  and  Domain  sizes  and  will  be  described  in  more  detail  in  chapter  3.  Also,  the 
cost  function  does  not  apply  just  to  a  specific  multiprocessor.  The  multiprocessor 
description  is  one  of  the  inputs  of  the  cost  computation  module.  Again  more  detail 
is  given  in  chapter  3.  The  cost  function  has  two  other  important  attributes.  First, 
in  an  intuitive  sense,  the  cost  metric  correlates  well  with  intuitive  notions  of  the 
relative  poorness  of  allocations.  This  intuition  is  justified  by  experimental  results 
obtained  from  an  implementation  of  the  allocator.  Second,  the  algorithms  to  com¬ 
pute  this  cost  function  have  polynomial-time  worst-case  complexity  in  the  size  of  the 
computation.  An  exponential-time  complexity  would  be  considered  unacceptable. 

The  allocator  module  consists  of  two  phases:  (1)  a  greedy  allocation  phase  and 
(2)  a  local  minimization  phase.  Let  us  assume  for  now  that  each  partition  of  the 
database  is  allocated  to  a  single  processor.  The  greedy  allocation  phase  allocates  the 
partitions  of  the  database  one  at  a  time,  allocating  the  latest  partition  to  the  least 
cost  processor  without  re-allocating  previously  allocated  partitions.  This  phase  has 
polynomial- time  worst  case  complexity.  This  is  followed  by  the  local-minimization 
phase.  In  this  phase,  partitions  of  the  database  may  be  re-allocated  to  neighboring 
processors  if  that  reduces  the  cost.  Let  a  round  consist  of  a  (possible)  single  re¬ 
allocation  of  each  part  of  the  program.  Each  round  has  polynomial-time  worst 
case  complexity.  Obtaining  a  local  minimum  of  the  cost-function  may  take  an 
exponential  number  of  rounds,  however.  Fortunately,  it  turns  out  that  the  greedy 

2  As  used  here,  the  term  domain-independence  means  that  the  definition  of  the  cost-function  and 
the  algorithms  to  compute  it  are  the  same  regardless  of  the  domain.  However,  certain  inputs  to 
the  cost-function  and  the  associated  procedures  may  depend  on  the  domain  of  interest.  Smith  [65] 
prefers  to  call  this  semi-independence  saving  the  use  of  independence  for  cases  where  absolutely  no 
domain  dependent  information  is  used. 
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Figure  4:  Allocator  Strategy 

allocation  phase  alone,  or  greedy  allocation  combined  with  a  limited  number  of 
rounds  of  the  local  minimization  phase,  produces  very  reasonable  allocations. 


1.8  Organization  of  Document 

Chapter  2  describes  PM,  the  parallel  execution  model.  Chapter  3  describes  the 
cost-function  that  is  the  basis  of  the  allocator.  The  chapter  includes  descriptions  of 
algorithms  to  compute  the  cost-function  and  re-compute  it  for  small  changes  in  the 
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allocation.  Chapter  4  describes  the  algorithms  for  allocation.  The  chapter  includes 
results  obtained  from  implementations  of  PM  and  the  allocator.  Finally,  chapter  5 
presents  a  summary  of  the  key  ideas  in  this  thesis  and  directions  for  future  research. 
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PM:  A  Parallel  Execution  Model 


2.1  Introduction 

The  parallel  execution  model  described  in  this  chapter  is  called  PM.  It  is  designed 
to  exploit  parallelism  for  backward-chaining  deduction.  In  addition,  PM  is  designed 
for  a  class  of  multiprocessors  that  includes  non-shared  memory  among  other  features 
(see  chapter  1  for  more  details).  Side-effects  to  the  database  of  facts  and  rules  sure 
not  allowed  during  the  computation  in  PM. 

A  key  feature  of  PM  is  that  all  control  of  execution  is  based  on  what  we 
call  dataflow*  graphs.  These  are  dataflow  graphs  [70]  augmented  with  two  non¬ 
dataflow  features — indeterminate  merge  and  local  state.  Dataflow*  carries  with  it 
the  dataflow  advantage  of  decentralized  control.  No  synchronization  is  required 
other  than  the  flow  of  data. 

Several  important  types  of  parallelism  have  been  identified  for  backward- 
chaining  deductions  [15,57].  The  three  that  are  exploited  by  PM  are  arid-parallelism, 
or- parallelism,  and  pipelining.  Or-parallelism  is  the  simultaneous  exploration  of 
multiple  paths  to  solving  a  single  goal.  And-parallelism  is  the  simultaneous  solution 
of  multiple  parts  of  a  conjunctive  goal.  Pipelining  also  applies  to  the  solution  of 
constituent  conjuncts  in  a  conjunctive  goal.  It  is  the  continuous  streaming  of  solu¬ 
tions  between  a  pair  (or  more)  of  conjunct  solvers  in  sequence.  Just  as  in  pipelined 
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computer  architectures,  pipelining  can  improve  the  throughput  of  processing. 

Unrestricted  and-parallelism  is  usually  not  exploited  because  of  its  wasteful, 
combinatoric  explosion.  Various  researchers  have  considered  different  methods  of 
restricting  and-parallelism  [57,15,19,61,41].  The  and-parallelism  exploited  by  PM 
is  of  the  type  described  by  Conery  [15],  where  conjunctive  goals  are  not  solved  in 
parallel  if  they  share  variables. 

Conery’s  execution  model  exploited  a  combination  of  or -parallelism  and  and- 
parallelism  [15].  Lindstrom  et  al.  [41]  and  I  [61}*used  a  combination  of  or-parallelism 
and  pipelining .  PM  is  unique  in  exploiting  all  three  together  for  the  class  of  archi¬ 
tectures  described  above  while  still  using  data-driven  control. 

Resource  allocation  techniques  are  needed  to  determine  (1)  the  distribution  of 
the  database  over  the  processors  and  (2)  the  processor  to  use  in  the  case  of  replica¬ 
tion  of  certain  parts  of  the  database.  Clearly,  this  will  strongly  affect  the  efficiency 
of  backward- chaining  deductions.  Chapters  3  and  4  will  describe  a  specific  resource 
allocation  strategy  for  PM. 

This  chapter  is  organized  as  follows.  First,  the  general  approach  towards  ex¬ 
ploiting  parallelism  is  described  in  section  2.2.  Next,  PM ,  the  parallel  execution 
model  advocated  by  this  chapter  is  described  in  section  2.3.  This  section  begins 
with  an  abstract  description  of  PM,  along  with  a  proof  of  correctness,  before  plung¬ 
ing  into  some  algorithmic  details.  Section  2.4  presents  some  extensions  to  the  basic 
execution  model.  Finally,  section  2.5  discusses  some  related  work  done  by  others. 


2.2  The  Approach 

Section  1.2  described  the  standard  sequential  approach  to  backward-chaining  de¬ 
duction.  This  section  describes  how  the  sequential  execution  model  may  be  changed 
to  exploit  parallelism. 

Many  different  parallel  interpretations  of  the  and-or  tree  are  possible.  One  could, 
of  course,  do  everything  in  parallel.  All  or-nodes  that  are  the  children  of  an  and- node 
can  be  solved  in  parallel  (or-parallelism)  and  all  and-nodes  that  are  the  children  of 
an  or-node  can  be  solved  in  parallel  (and-parallelism).  For  and-parallelism,  this 
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would  mean  running  a  process  for  each  of  the  conjuncts  in  parallel.  This  would 
generate  many  solutions,  most  of  which  might  fail  if  there  were  shared  variables  in 
the  conjunct  that  must  be  simultaneously  satisfied.  Therefore,  in  general,  it  is  a 
good  idea  to  avoid  this  highly  combinatoric  explosion. 

The  solution  adopted  here  is  to  exploit  all  the  or-parallelism  but  to  take  a  more 
conservative  position  with  respect  to  and-parallelism.  Only  those  and-nodes  that  do 
not  share  any  common  variables  are  solved  in  parallel.  Assume  for  now  (until  section 
2.4  on  extensions  to  the  basic  execution  model)  that  the  solution  of  an  and-node 
binds  all  the  variables  in  the  associated  literal  to  ground  terms  (i.e.,  terms  with  no 
variables).  Once  and-nodes  bind  certain  variables,  then  other  and-nodes  may  stop 
sharing  unbound  variables  and  those  nodes  can  then  be  solved  in  parallel.  One  can 
think  of  the  and-nodes  as  being  arranged  in  a  directed,  acyclic  graph  (DAG).  Notice 
that  each  application  of  a  rule  in  the  database  produces  one  such  DAG.  There  is 
a  one  to  one  correspondence  between  the  literals  in  the  body  of  the  rule  and  the 
nodes  in  the  DAG.  Two  examples  that  satisfy  the  constraint  described  above  are 
shown  for  the  same  conjunctive  goal  in  figure  5. 

Solutions  from  nodes  flow  to  their  downstream  neighbors  which  can  then  be 
solved  in  parallel.  Solutions  are  sent  in  a  continuous  stream  in  contrast  to  the 
backtracking  control  of  sequential  and  most  parallel  execution  models.  This  is  the 
essence  of  pipelining. 

In  general,  some  possible  DAGs  for  a  rule  application  will  be  solved  more  ef¬ 
ficiently  than  others.  In  fact,  this  problem  is  analogous  to  ordering  conjuncts  for 
efficient  sequential  interpretation  [64].  This  problem  is  important  but  is  not  the 
subject  of  this  thesis.  In  this  thesis,  a  heuristic  algorithm  selects  the  DAG  at  run¬ 
time.  The  algorithm  is  described  in  appendix  A.  The  input  to  the  algorithm  is  a 
total  order  for  a  set  of  conjuncts — just  as  one  would  specify  in  Prolog,  for  example. 
The  partial  order  generated  by  the  algorithm  is  a  minimal  subset  of  this  total  order 
satisfying  the  constraint  that  conjuncts  sharing  unbound  variables  must  be  solved 
sequentially.  Note  that  the  chosen  L.1.G  is,  in  general,  different  when  different  sets 
of  variables  get  bound  at  ride  application  time.  In  addition,  the  specific  DAG  rep¬ 
resentation  of  the  partial  order  is  minimal  (in  the  number  of  edges  used).  The 
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Figure  5:  Example  DAGs 
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complexity  of  the  algorithm  is  0(n3),  where  n  is  the  number  of  and-nodes. 

The  database  is  distributed  to  the  processors  in  the  system  according  to  three 
constraints.  First,  the  set  of  clauses  must  be  partitioned  into  mutually  exclusive  and 
exhaustive  subsets  such  that  each  literal  goal  generated  during  backward-chaining 
can  be  reduced  by  a  single  subset.  A  partition  that  satisfies  this  constraint  is 
simply  a  partition  based  on  predicate  symbols  (of  facts  and  consequents  of  rules).  Of 
course,  other  partitions  may  be  possible  as  well.  Second,  each  subset,  in  its  entirety, 
must  be  separately  resident  in  the  memory  of  one  or  more  processors.  Third,  the 
distribution  of  the  database  is  done  completely  before  any  goal  is  presented  to  the 
system.  (There  is  no  reason  why  run-time  distribution  of  the  database  cannot  be 
done.  It  is  just  that  it  is  not  explored  in  this  thesis.) 


2.3  Basic  Execution  Model 


The  basic  execution  model  deals  with  a  simplified  view  of  the  multiprocessor  envi¬ 
ronment  as  well  as  of  backward-chaining.  The  additional  complexities  are  handled 
in  the  extensions  to  the  basic  execution  model. 

The  simplifications  are  as  follows:  (1)  It  is  assumed  that  the  set  of  clauses 
pertinent  to  reducing  any  particular  goal  are  in  a  single  processor.  For  example, 
if  facts  are  partitioned  on  the  basis  of  predicate  symbols,  all  facts  with  a  certain 
predicate  symbol  are  in  a  single  processor.  (2)  It  is  assumed  that  once  the  database 
is  distributed  over  the  multiple  processors,  there  is  no  shortage  of  dynamic  storage 
at  individual  processors  during  the  computation.1  (3)  Finally,  it  is  assumed  that 
all  solutions  to  a  goal  bind  all  the  variables  in  the  goal  to  ground  terms  (i.e.,  terms 
not  containing  any  variables). 


*It  can  be  argued  that  this  simplification  violates  the  assumption  of  limited  memory  at  each 
processor.  In  general,  it  is  impossible  to  guarantee,  even  for  sequential  computations,  that  the 
amount  of  dynamic  memory  is  sufficient  for  the  given  computation.  In  specific  cases,  for  both 
sequential  and  parallel  computations,  it  may  be  possible  to  guarantee  that  the  amount  of  memory 
is  sufficient. 
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2.3.1  Notation  and  Definitions 

Let  <  Ei ,  E2,  ---lEn  >  denote  a  tuple  of  elements  Ex,  E2,...,  En. 

Let  {Ei,  E2, ...» En}  denote  a  set  of  elements  Eu  E2, . . . ,  En. 

Bindings  of  variables  are  given  as  Variable  1  =  terml.  Unification  of  two  literals 
may  result  in  a  substitution  given  by  a  set  of  bindings.  For  example, 

Substitution!  =  {V2  =  term2,VZ  =  termZ} 

The  domain  of  a  substitution  is  defined  to  be  the  set  of  variables  whose  bindings 
are  given  in  the  substitution.  For  example,  the  domain  of  the  substitution  {V2 
term2,VZ  =  termZ}  is  {V2,VZ}.  Similarly,  the  range  of  a  substitution  is  defined 
to  be  the  set  of  variables  that  appear  in  the  bindings  of  the  domain  variables.  For 
example,  the  range  of  the  substitution  {V2  =  V3,  V4  =  V5}  is  {V3,  V5}. 

Two  substitutions  may  be  composed  to  produce  a  single  substitution.  For  any 
two  substitutions,  51  and  52,  Composition's!,  Si)  is  defined  only  if  the  following 
two  conditions  hold:  (1)  The  intersection  of  the  domains  of  51  and  52  is  the  null 
set  and  (2)  The  intersection  of  the  range  of  52  and  the  domain  of  51  is  the  null 
set.  In  particular,  what  is  allowed  is  for  the  domain  of  52  to  contain  some  variables 
belonging  to  the  range  of  51.  For  example,  the  two  conditions  are  satisfied  for  the 
following  case: 

S!  =  {X  =  Y,U  =  V} 


S2={Y  =  P,V  =  Q} 

When  the  two  conditions  are  satisfied,  the  Composition  function  is  simply  the  union 
function  for  sets.  In  the  example,  the  composition  would  be  {X  =  Y,U  =  V,Y  = 
p}y  -  Q}.  Also,  two  substitutions,  51  and  52  are  equivalent  if  52  can  be  obtained 
from  51  by  replacing  the  binding  of  a  variable  belonging  to  51,  Var!  —  term 1, 
by  Var!  =  term!  \binding2,  where  binding2  belongs  to  51  and  | binding 2  indicates  the 
application  of  binding  binding2.  For  example,  {X  =  Y,  U  =  V,  Y  =  P,  V  =  Q}  is 
equivalent  to  {X  =  P,U  =  Q}. 
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2.3.2  Behavioral  Description 

This  section  contains  an  abstract  behavioral  description  of  the  basic  execution 
model.  The  next  section  contains  a  proof  of  correctness  of  this  description.  As 
will  be  pointed  out  later  in  detail,  extra  structure  will  be  added  to  this  description 
to  make  it  more  suitable  for  an  implementation.  It  is  in  this  spirit  that  we  will  treat 
streams  of  messages  as  sets  of  messages  (without  an  ordering)  in  this  section  and  in 
the  next  one. 

The  basic  computation  unit  is  a  sequential  process.  Processes  contain  state  and 
they  are  connected  together  by  communication  channels  (abbreviated  channels). 
Communication  between  processes  takes  place  by  sending  a  set  of  messages  across 
each  channel.  Channels  are  directed.  All  messages  that  are  sent  at  one  end  of  a 
channel  must  arrive  at  the  other  end.  Due  to  the  correspondence  between  processes 
and  channels  with  nodes  and  arcs  respectively  in  a  directed  graph,  the  pairs  of 
terms  process/node  and  channel/ arc  will  be  used  interchangeably  in  the  rest  of  this 
chapter. 

Parallelism  in  the  basic  execution  model  is  achieved  by  running  different  pro¬ 
cesses  in  different  physical  processors.  Of  course,  more  than  one  process  may  be 
mapped  to  the  same  processor  due  to  resource  constraints  and  communication  re¬ 
quirements.  The  details  of  setting  up  processes  on  different  processors  will  be  de¬ 
scribed  in  the  section  on  algorithmic  details  (section  2.3.4). 

We  use  the  phrase  behavioral  description  to  denote  a  set  of  functions  that  take 
inputs  and  the  current  state  as  arguments  and  return  outputs  and  a  new  state.  A 
set  of  functions  is  needed  because  different  types  of  incoming  channels  need  different 
functions. 

A  very  high  level  description  is  given  now  for  the  parallel  computation,  with  more 
details  given  in  later  paragraphs.  There  are  three  types  of  processes  (represented 
by  boxes)  and  six  types  of  channels  (represented  by  directed  arcs)  as  shown  in 
figure  6.  All  messages  on  all  channels  consist  of  a  single  substitution  each.  Each 
Normal  process  is  responsible  for  solving  one  literal  for  a  set  of  substitutions  Si. 
Si  is  a  function  (to  be  described  later)  of  the  sets  of  substitutions  that  are  received 
along  the  Input  channels  of  the  Normal  process.  All  solutions  of  the  literal  for 
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Figure  6:  Types  of  Processes  and  Channels 


the  Si  substitutions  are  sent  out  on  each  of  the  Output  channels  of  the  Normal 
process.  These  solutions  are  obtained  by  the  reduction  of  goals,  represented  by  the 
application  of  substitutions  in  Si  to  the  literal,  by  rules  or  facts.  If  a  rule  is  used,  a 
DAG  of  conjunctive  subgoals  may  be  obtained  of  the  type  shown  in  figure  5,  each 
conjunct  being  represented  by  its  own  Normal  process.  The  Head  and  Tail  processes 
shown  in  figure  6  are  used  just  for  the  initiation  of  computation  associated  with  the 
DAG  and  the  collection  of  solutions  from  the  DAG.  If  a  fact  is  used  instead  of  a 
rule,  one  can  just  think  of  the  DAG  as  being  empty  and  the  Head  and  Tail  processes 
as  being  directly  connected  to  each  other. 

Other  than  Input  and  Output  channels,  there  axe  Task,  Subtask,  Solution  and 
Subsolution  channels.  A  process  can  have  at  most  one  Task  channel  or  Solution 
channel.  Also,  each  Subtask  channel  has  a  corresponding  Subsolution  channel.  In 
addition,  no  single  process  can  have  all  types  of  channels.  A  Normal  process,  as 
shown  in  figure  7,  can  have  Input,  Output,  Subtask,  and  Subsolution  channels  only. 
A  Head  process  can  have  a  single  Task  channel  and  some  Output  channels  only 
as  shown  in  figure  8.  A  Tail  process  can  have  some  Input  channels  and  a  single 
Solution  channel  only  as  shown  in  figure  9.  Each  channel  has  a  dual  purpose  when 
viewed  from  the  perspective  of  the  two  processes  it  connects.  In  particular,  the  dual 
types  have  to  be  one  of  Input/Output,  Task/Subtask,  or  Solution/Subsolution. 
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Figure  9:  A  Tail  Process 
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As  mentioned  before,  the  substitutions  on  the  input  channels  to  a  normal  process 
represent  goals  that  the  process  must  solve.  In  particular,  all  input  channels  to  a 
normal  process  are  functionally  equivalent  to  just  one  hypothetical  channel  called 
the  virtual  input  channel.  Each  substitution  in  the  virtual  input  channel,  when 
applied  to  the  literal  associated  with  a  normal  process,  represents  a  goal  that  the 
process  must  solve.  The  set  of  substitutions  in  the  virtual  input  channel  is  obtained 
by  applying  a  function  called  CP  to  the  sets  of  substitutions  in  the  multiple  input 
channels.  Informally,  CP  computes  the  cartesian  product  of  the  sets  of  substitutions 
on  the  input  channels  and  filters  out  inconsistent  combinations  of  substitutions.  The 
need  for  this  filtering  can  be  seen  in  figure  10.  The  binding  of  variable  “X”  in  a 
substitution  along  the  first  input  channel  to  process  “d(X,Y,Z)”  may  be  inconsistent 
with  the  binding  of  “X”  in  a  substitution  along  the  second  input  channel.  This 
combination  should  be  filtered  out. 

The  formal  definition  of  CP  is  given  now  enclosed  by  the  labels  Begin  formal 
definition  of  CP  and  End  formal  definition  of  CP.  Readers  satisfied  with  the 
informal  definition  of  CP  given  above  may  skip  this  detail  safely. 

Begin  formal  definition  of  CP 

The  formal  definition  of  CP  uses  an  auxiliary  function  Merge.  The  input 
to  Merge  is  n  substitutions  IS\,IS2,  ...,ISn.  The  output  is  a  substitution  or 
a  special  element  _L  that  is  not  a  substitution.  If  there  exists  some  variable  V 
such  that  its  binding  in  I  Si  (1  <  i  <  n)  is  V  =  6,-  and  its  binding  in  ISj 
(1  <  j  <  n)  is  V  =  bj  and  6,-  ^  bj ,  then  Merye(I5i,  IS2> . . .  ,ISn)  =-L.  Other¬ 
wise,  Merge(ISi,IS2i...,ISn)  =  Union(ISuIS2,. --,ISn).  Union  is  the  normal 
set  union.  The  element  _L  is  used  to  indicate  that  inconsistent  bindings  of  some 
variable  exist  in  the  substitutions.  This  is  used  in  the  definition  of  CP  to  filter 
out  such  combinations  of  substitutions.  A  couple  of  examples  of  Merge  are  given 
below. 

M  erge({X  =  zl,  Y  =  yl},  { X  =  zl,  Z  —  zl})  =  {X  =  zl,  Y  —  yl ,  Z  =  zl} 

Merge({X  =  zl ,Y  =  yl},  {X  —  z2,  Z  =  zl})  =-L 
Note  that  all  bindings  are  to  ground  terms  as  assumed  earlier. 
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Figure  10:  Filtering  Substitutions 

The  input  to  the  function  CP  is  n  sets  of  substitutions  I S I S S2i  ••.)  I S Sn. 
The  output  is  a  set  of  substitutions. 

CP(ISSi,ISS2,  ...,ISSn) 

=  {Merge(ei,e 2, . . . ,  en)  |  e^el S S\,e2el S S2,  •  •  • ,  eneISSn}  —  {-L} 

”  is  used  to  denote  set  difference. 

As  a  specific  example, 

CP({{X  =  xl ,Y  =  yl},{X  =  x2,Y  =  y2}},{{X  =  x2,Z  =  zl},{X  -  x2,Z  =  z2}}) 

=  {{X  =  x2,Y  =  y2,  Z  =  2I},  {X  =  x2 ,Y  =  y2,Z  =  z2}} 

End  formal  definition  of  CP 

We  have  seen  that  the  set  of  substitutions  in  the  virtual  input  channel  is  obtained 
by  applying  the  function  CP  to  the  sets  of  substitutions  on  the  input  channels.  It 
is  in  this  sense  that  a  single  virtual  input  channel  is  equivalent  to  the  multiple 
input  channels  to  a  normal  process.  Therefore,  without  loss  of  generality,  we  can 
complete  the  behavioral  description  of  a  normal  process  assuming  just  one  input 
channel — the  virtual  input  channel. 

Just  as  a  normal  process  can  have  more  than  one  input  channel,  it  can  have 
more  than  one  output  channel.  The  messages  on  all  output  channels  are  identical. 
Therefore,  in  addition  to  assuming  just  one  input  channel,  we  can  assume  just 
one  (virtual)  output  channel  to  complete  the  behavioral  description  without  loss  of 
generality. 
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Figure  11:  Reduction  of  Goals 

As  mentioned  before,  each  substitution  in  the  virtual  input  channel  applied  to 
the  literal  associated  with  a  normal  process  is  an  input  goal  for  the  process  to  solve. 
The  solution  to  each  goal  is  also  represented  as  a  set  of  substitutions.  The  set  of 
substitutions  in  the  output  channel  is  the  union  of  the  sets  of  solutions  of  the  input 
goals. 

First,  consider  the  case  when  the  logic  program  contains  only  assertions  to  solve 
a  particular  input  goal  for  a  normal  process.  In  this  case,  the  goal  can  be  solved 
immediately  and  sent  out  on  the  output  channel  of  the  process. 

When  the  logic  program  also  contains  rules,  additional  computation  needs  to  be 
performed.  All  solutions  found  by  using  assertions  are  immediately  sent  on  the  out¬ 
put  channel  as  before.  For  each  rule  that  can  be  used  to  reduce  the  goal,  unification 
is  attempted  between  the  goal  and  the  head  of  the  rule.  If  unification  fails,  nothing 
further  needs  to  be  done  for  this  goal/rule  pair.  If  unification  succeeds,  the  substi¬ 
tution  used  for  the  unification  is  used  to  create  a  subgoal.  The  subgoal  is  simply  the 
substitution  applied  to  the  tail  of  the  rule.  A  matching  pair  of  subtask/subsolution 
channels  is  created  for  the  process  as  shown  in  figure  11.  The  input  substitution 
that  created  the  goal  is  kept  in  the  process  as  state  to  be  used  later.  The  subtask 
channel  carries  just  one  element,  an  empty  substitution,  to  start  the  solution  of  the 
subgoal.  The  subsolution  channel  brings  back  a  set  of  solutions  to  the  subgoal. 
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To  make  the  solution  of  the  subgoal  possible,  a  two-terminal  DAG  of  processes 
is  set  up  between  the  subtask  and  subsolution  channels.  The  graph  is  called  two- 
terminal  because  it  has  two  special  nodes,  an  input  node  and  an  output  node.  The 
input  node  contains  arcs  to  all  nodes  without  any  other  inputs  and  the  output  node 
contains  arcs  from  all  nodes  that  do  not  have  any  other  outputs.  In  our  case,  the 
input  and  output  nodes  are  the  Head  and  Tail  nodes  respectively.  The  DAG  between 
the  Head  and  Tail  nodes  is  of  the  type  shown  in  figure  5  for  conjunctive  goals.  The 
DAG  corresponds  to  the  conjunctive  goal  that  is  obtained  by  instantiating  the  tail 
of  the  rule  with  the  unification  substitution.  An  example  of  such  a  two-terminal 
DAG  is  shown  in  figure  12.  Notice  that  variables  U  and  V  have  been  renamed 
to  U101  and  V102.  In  fact,  ail  variables  in  the  rule  must  be  “standardized  apart” 
before  unification  [48].  When  the  subgoal  graph  is  set  up,  a  piece  of  state,  called  the 
Invocation- Substitution,  needs  to  be  stored  in  the  Tail  process.  This  is  the  subset 
of  the  substitution  (resulting  from  unifying  the  goal  with  the  rule)  that  contains 
bindings  of  variables  in  the  goal  (i.e.,  bindings  of  variables  in  the  rule  axe  ignored). 
Figure  12  shows  the  Invocation- Substitution,  shown  as  IS  in  the  figure,  that  needs 
to  be  stored  for  the  example.  Notice  that  this  design  decision  leads  to  what  might 
be  called  distributed  binding  environments.  An  alternative  might  have  been  to  copy 
the  complete  environment  and  send  it  to  the  processes  associated  with  the  subgoal 
graph.  However,  the  problem  with  copying  is  that  the  environments  might  get  very 
large  and  the  messages  containing  them  may  have  excessive  communication  delays. 

The  top  level  goal  to  the  system  is  also  represented  like  any  other  subgoal  in 
the  system  (i.e.,  it  is  a  two-termmal  DAG  of  processes).  For  the  top  level  goal,  the 
Invocation- Substitution  is  empty.  A  top  level  goal  is  shown  in  figure  13. 
next  to  the  task  channel  of  the  Head  process  indicates  that  the  set  contains  just 
one  element,  an  empty  substitution. 

The  purpose  of  the  Head  and  Tail  processes  needs  to  be  explained  now.  Both 
are  not  associated  with  any  literal. 

The  Head  process  merely  serves  as  a  router  of  data.  When  it  receives  an  empty 
substitution  along  its  subtask  channel,  it  sends  copies  of  the  same  on  all  its  output 
channels. 
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Rule  a(W,x1  ,Y,Z) b(W,Y),c(Y,U),d(Y,V),e(U>V,Z) 

Goal  a(w1  ,X,Y,Z) 


Figure  12:  Example  of  Goal  Reduction 


Figure  13:  Top  Level  Goal 
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As  mentioned  above,  tbe  Tail  process  stores  an  Invocation- Substitution  in  its 
state.  The  Tail  process  receives  substitutions  along  its  input  channels  and  it  com¬ 
putes  the  cartesian  product  of  the  associated  sets  of  substitutions  like  any  other 
normal  process.  The  rest  of  its  behavior  is  different  from  a  normal  process.  For 
each  substitution  on  its  virtual  input  channel,  it  sends  a  substitution  on  its  solu¬ 
tion  channel.  The  solution  substitution  is  created  by  applying  the  Composition 
function  to  the  Invocation- Substitution  and  the  input  substitution.  As  an  exam¬ 
ple,  consider  figure  12  again.  If  the  Tail  process  receives  the  input  substitution 
{Y=yl,U101=ul,V102=vl,Z=zl},  then  the  corresponding  solution  substitution  is 
{X=xl,Y=yl,U101=ul,V102=vl,Z=zl}. 

A  normal  process  may  have  severed  subsolution  channels,  one  for  each  of  the 
subgoals  created.  The  input  substitution  used  to  create  the  goal  is  kept  as  state 
in  the  process.  When  the  process  starts  receiving  substitutions  along  its  subsolu¬ 
tion  arcs,  the  following  is  done  for  each  substitution:  The  Composition  function  is 
applied  to  the  associated  input  substitution  and  the  subsolution  substitution.  The 
resulting  substitution  is  sent  out  on  the  virtual  output  channel.  Subsolution  sub¬ 
stitutions  are  processed  in  this  manner  as  they  arrive.  If  the  order  of  arrival  cannot 
be  determined  (when  they  arrive  too  close  to  resolve  the  difference  in  times),  then 
they  are  processed  in  an  indeterminate  order.  It  is  in  this  sense  that  we  can  say  that 
the  output  channel  of  the  process  is  created  from  the  indeterminate  merge  of  the 
solutions  of  its  subgoals.  As  an  example,  consider  figure  12  again.  If  a  subsolution 
substitution  for  the  given  goal  is  {X=xl,Y=yl,U101=ul,V102=vl,Z=zl},  then  the 
corresponding  output  substitution  is  {W=wl,  X=xl,  Y=yl,  U101=ul,  V102=vl, 
Z=zl}.2 

The  graph  that  is  generated  in  the  process  of  goal  reductions  starting  from  the 
top-level  goal  is  the  dataflow*  graph  for  the  computation.  Note  that  this  graph  is 
not  present  before  run-time.  Also,  there  is  no  need  to  have  an  explicit  representation 
of  this  graph  at  run-time.  However,  algorithms,  presented  later  in  chapters  3  and  4, 

2Clearly,  the  bindings  for  variables  U101  and  VI 02  are  not  necessary.  If  required,  these  could 
have  been  pruned  either  by  the  Tail  process  or  the  Normal  process.  The  current  implementation 
leaves  these  bindings  in  because  they  provide  useful  information  during  program  development.  A 
production  system  should  prune  these  bindings  if  its  only  goal  is  efficiency. 


34 


CHAPTER  2.  PM:  A  PARALLEL  EXECUTION  MODEL 


will  be  used  to  predict  certain  properties  of  these  graphs  for  the  purpose  of  resource 
allocation. 

2.3.3  Proof  of  Correctness 

Theorem  1  For  deductions  with  a  finite  and-or  tree,  the  set  of  solutions  produced 
by  PM  is  equal  to  the  set  of  solutions  produced  by  a  Prolog  interpreter. 

Notice  that  the  Prolog  interpreter  was  defined  in  section  1.2.  To  prove  the 
theorem,  we  will  prove  two  lemmas  first.  Before  we  get  to  the  lemmas,  a  few 
definitions  need  to  be  stated. 

For  a  directed  graph,  a  node  N1  is  defined  to  be  a  direct  predecessor  of  node 
N2  if  and  only  if  there  is  an  edge  from  N1  to  N2.  Similarly,  a  node  N1  is  defined 
to  be  an  ancestor  of  N2  if  and  only  if  N1  is  in  the  transitive  closure  of  the  direct 
predecessor  relation  of  N2.  If  a  directed  arc  goes  from  node  A  to  node  B,  A  is  called 
the  source  node  and  B  is  called  the  destination  node  of  the  axe.  Note  that  “node” 
and  “process”  are  used  interchangeably. 

Lemma  1  For  each  input  channel  to  a  process  P,  if  the  set  of  substitutions  con¬ 
tained  in  the  channel  is  equal  to  the  set  of  solutions  to  the  conjunctive  goal  CGI, 
where  CGI  is  the  set  of  the  literal  associated  with  the  source  process  of  the  channel 
and  all  literals  associated  with  the  ancestors  of  the  source  process,  then  the  set  of 
substitutions  in  the  virtual  input  channel  of  the  process  P  is  equal  to  the  set  of  so¬ 
lutions  to  the  conjunctive  goal  CG2,  where  CG2  is  the  set  of  literals  associated  with 
all  the  ancestors  of  the  process  P. 

Proof:  The  statement  “For  each  input  channel  to  a  process  P,  the  set  of  substi¬ 
tutions  contained  in  the  channel  is  equal  to  the  set  of  solutions  to  the  conjunctive 
goal  CGI,  where  CGI  is  the  set  of  the  literal  associated  with  the  source  process 
of  the  channel  and  all  literals  associated  with  the  ancestors  of  the  source  process” 
in  the  first  part  of  the  lemma  will  be  referred  to  as  the  correctness  condition  of 
the  lemma.  Assume  for  now  that  the  process  in  question  has  two  input  channels. 
The  proof  can  be  easily  extended  to  an  arbitrary  number  of  channels  by  induction 
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on  the  number  of  channels.  Let  the  set  of  literals  associated  with  the  source  pro¬ 
cess  of  the  first  channel  and  all  its  ancestors  be  {Cl,C2,...,Ci,Ci+l,...,Cm}  and  the 
corresponding  set  for  the  second  channel  be  {Ci,Ci+l,...,Cm,Cm+l,...,Cn}.  Call 
these  two  sets  A  and  B  respectively.  Notice  that  the  two  sets  have  an  arbitrary 
set  of  literals,  {Ci,Ci+l,...,Cm},  in  common.  The  set  of  ancestors  of  the  process  is 
given  by  the  union  of  A  and  B,  {Cl,C2,...,Cn}.  Call  this  set  C.  We  know  that  the 
solutions  to  C  are  exactly  the  same  as  the  solutions  of  the  bag,  D,  containing  the 
stun3  of  A  and  B  considered  as  bags.  This  is  true  because  a  conjunctive  goal  with 
duplicate  conjuncts  is  equivalent  to  a  conjunctive  goal  with  the  duplicates  removed. 
Therefore,  the  lemma  is  reduced  to  the  statement  that  applying  CP  to  the  sets 
of  solutions  of  A  and  B  gives  exactly  the  set  of  solutions  to  the  conjunctive  goal 
composed  of  A  and  B.  This  simplified  statement  will  be  proved  by  showing  a  subset 
relationship  both  ways. 

First,  let  us  prove  that  every  solution  of  the  conjunction  of  A  and  B  is  a  member 
of  the  result  of  CP.  Let  us  pick  an  arbitrary  solution  SI.  We  know  that  any  solution 
of  a  set  of  conjuncts  must  be  a  solution  of  a  subset  also.  (This  follows  easily  from  the 
definition  of  the  Prolog  interpreter  in  section  1.2.)  Therefore,  Si  must  be  a  solution 
of  A  and  it  must  be  a  solution  of  B.  Of  course,  Si  may  contain  a  superset  of  the 
bindings  required  for  A  and  B  separately.  In  addition,  the  correctness  condition  of 
the  lemma  tells  us  that  this  solution  must  be  a  member  of  both  the  input  sets  of 
substitutions  to  the  node.  Actually,  only  the  subset  of  Si  relevant  to  A  will  be  in 
the  first  channel.  The  same  applies  for  B.  If  this  is  the  case,  then  the  definition  of 
CP  requires  that  the  union  of  the  two  substitutions  along  the  two  channels  (i.e., 
Si)  be  a  member  of  the  result  of  CP. 

Now,  let  us  show  the  reverse  subset  relationship  to  prove  equality  of  the  two  sets. 
We  need  to  show  that  every  member  of  the  result  of  CP  is  a  member  of  the  solution 
set  of  the  conjunction  of  A  and  B.  Recall  from  the  definition  of  CP  that  each 
member  of  the  result  of  CP  above  will  be  the  union  of  a  substitution  from  the  first 
channel  and  a  substitution  from  the  second  channel.  In  other  words,  each  member 

3Sum  of  bags  is  different  from  union  of  sets  in  the  following  way.  The  number  of  instances  of  a 
member  of  the  sum  is  equal  to  the  sum  of  the  number  of  instances  of  the  member  in  the  bags  whose 
sum  is  taken. 
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of  the  result  of  CP  is  a  superset  of  a  substitution  on  the  first  channel  and  also  a 
superset  of  a  substitution  on  the  second  channel.  Since  the  correctness  condition  of 
the  lemma  states  that  each  member  of  the  first  channel  is  a  solution  of  A  and  each 
member  of  the  second  channel  is  a  solution  of  B,  each  member  of  the  result  of  CP 
is  a  solution  to  A  as  well  as  B.  Therefore,  it  is  a  solution  of  the  conjunction  of  A 
and  B.  (This  follows  easily  from  the  definition  of  the  Prolog  interpreter  in  section 

1.2.)D 

Lemma  2  Consider  a  two-terminal  DAG  of  processes  in  which  the  input  node  ts  a 
Head  process,  the  output  node  is  a  Tail  process,  and  the  DAG  in  between  is  composed 
of  normal  processes.  For  this  graph,  sending  the  Head  process  an  empty  substitution 
will  produce,  at  the  virtual  input  of  the  Tail  process,  the  set  of  solutions  to  the  con¬ 
junctive  goal  composed  of  the  literals  associated  with  the  normal  processes  provided 
that  each  process  individually  solves  the  goals  input  to  it  correctly. 

Proof:  The  statement  “each  process  individually  solves  the  goals  input  to  it 
correctly”  in  the  last  part  of  the  lemma  will  be  called  the  correctness  condition  of 

this  lemma. 

We  need  to  define  the  concept  of  distance  of  a  process  from  the  Head  process. 
Let  distance  of  1  denote  that  there  is  a  direct  edge  from  the  Head  process  to  the 
process.  A  distance  of  n  indicates  that  the  maximum  distance  of  a  direct  predecessor 
of  the  process  is  n  - 1.  Let  the  distance  of  the  Tail  node  also  represent  the  length  of 
the  graph.  Notice  that  all  such  graphs  have  a  finite  length  because  they  are  DAGs. 

Now,  the  lemma  is  trivially  true  for  all  such  graphs  in  which  the  graph  length  is 
2.  In  this  case,  there  are  a  set  of  normal  nodes  in  parallel  after  the  Head  node  and 
there  are  edges  from  all  these  nodes  to  the  Tail  node.  In  this  case,  lemma  1  applies 

directly. 

Now,  the  induction  hypothesis  is  that  the  lemma  is  true  for  graphs  of  lengths 
up  to  n.  The  induction  step  requires  that  we  prove  that  the  lemma  is  also  true  for 
graphs  of  length  n  +  1.  For  a  graph  of  length  n  +  1,  consider  all  nodes  that  are 
direct  predecessors  of  the  Tail  node.  Replace  one  such  node  P  by  a  new  Tail  node 
and  consider  the  graph  between  this  new  Tail  node  and  the  original  Head  node. 
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The  induction  hypothesis  can  be  applied  to  this  graph  because  it  has  a  length  of 
n  or  less  only.  Therefore,  the  set  of  substitutions  in  the  virtual  input  channel  of 
the  new  Tail  process  is  equal  to  the  set  of  solutions  of  the  associated  conjunctive 
goal.  This  has  an  implication  for  the  original  graph.  The  set  of  substitutions  in  the 
virtual  input  channel  of  the  node  P  (that  is  transformed  to  a  Tail  node)  is  equal 
to  the  set  of  solutions  to  the  conjunctive  goal  represented  by  the  set  of  nodes  that 
are  ancestors  of  the  node  P.  Now,  the  virtual  input  channel  of  the  node  represents 
goals  for  the  node.  The  correctness  condition  of  the  lemma  states  that  all  such  goals 
are  correctly  solved.  Therefore,  the  output  channel  of  the  node  P  (which  is  also  an 
input  channel  of  the  original  Tail  process)  will  contain  the  set  of  solutions  to  the 
conjunctive  goal  of  the  literals  represented  by  the  node  P  and  all  its  ancestors.  The 
same  can  be  claimed  for  all  input  channels  of  the  Tail  node.  Now,  we  can  apply 
lemma  1  to  prove  that  the  set  of  substitutions  in  the  virtual  input  channel  of  the 
Tail  node  is  equal  to  the  set  of  solutions  of  the  conjunction  of  all  ancestors  of  the 
Tail  node  (i.e.,  all  literals  in  the  original  graph). □ 

Proof  of  Theorem  1;  The  dataflow*  graph  contains  some  normal  processes 
whose  solutions  are  produced  by  unification  with  facts  directly  and  not  by  reduction 
to  a  DAG  of  processes  obtained  by  applying  a  rule.  If  there  were  no  such  normal 
processes,  the  associated  and-or  tree  would  be  infinite  and  the  computation  would 
never  end.  Let  us  refer  to  these  nodes  as  nodes  of  level  1.  In  general,  a  node  is 
defined  to  be  of  level  n  +  1  if  and  only  if  the  maximum  level  of  any  node  in  any  of 
its  subgoal  graphs  is  n.  The  maximum  node  level  in  the  dataflow*  graph  is  called 
the  level  of  the  graph. 

The  theorem  will  be  proved  by  induction  on  the  level  of  dataflow*  graphs.  The 
theorem  is  trivially  true  for  graphs  of  level  1  because  lemma  2  applies  directly.  The 
induction  hypothesis  is  that  it  is  true  for  graphs  of  level  up  to  n.  We  need  to  show 
that  it  is  true  for  graphs  of  level  n  +  1.  At  the  top  level  in  this  dataflow*  graph  is  a 
two-terminal  graph  with  some  nodes  of  level  n  + 1.  For  each  such  node,  its  subgoals 
contain  nodes  of  level  n  or  less.  Therefore,  each  of  its  subgoals  is  correctly  solved 
according  to  the  induction  hypothesis.  Since  the  solution  of  the  node  is  obtained 
simply  by  taking  the  indeterminate  merge  of  the  solutions  of  the  subgoals,  the  node 
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itself  is  correctly  solved.  (Indeterminate  merge  produces  the  same  solutions  as  the 
ones  produced  by  backtracking  using  the  Prolog  interpreter  defined  in  section  1.2.) 
Now,  the  application  of  lemma  2  proves  that  the  top  level  DAG  is  also  correctly 
solved.  □ 


2.3.4  Algorithmic  Details 

The  description  of  the  basic  execution  model  that  was  presented  in  section  2.3.2  was 
complete  in  its  own  right.  However,  modification  of  certain  peripheral  details  makes 
the  implementation  easier  or  more  efficient.  In  addition,  it  is  much  too  abstract  for 
a  direct  implementation.  In  this  section,  we  describe  the  additional  features  that 
are  added  to  the  abstract  description  and  then  describe  specific  choices  made  in 
terms  of  state,  messages,  and  procedures. 

2.3.4. 1  Additional  features 

There  are  three  additional  features.  First,  messages  along  channels  are  treated  as 
streams  as  opposed  to  sets.  Second,  messages  contain  more  than  just  substitutions. 
Third,  each  stream  of  messages  is  terminated  by  a  special  end- of- stream  message. 

Sets  to  Streams  This  is  the  most  important  additional  feature.  It  is  due  to 
this  feature  that  PM  gets  its  dataflow  flavor.  Every  channel  contains  a  stream  of 
messages.  A  stream  is  equivalent  to  an  ordered  set.  In  general,  channels  are  not 
required  to  preserve  the  ordering  of  messages  from  their  inputs  to  their  outputs. 
Therefore,  two  messages  that  axe  sent  in  one  order  from  the  source  process  of  a 
channel  may  arrive  in  another  order  at  the  destination  process  of  the  channel. 
Typically,  messages  do  arrive  in  order.  The  advantage  of  not  requiring  in-order 
delivery  is  that  message  protocols  can  generally  be  simpler  and  faster. 

Computation  at  processes  is  triggered  only  by  the  arrival  of  messages  and  by 
no  other  mechanism.  In  particular,  complete  streams  need  not  arrive  for  processing 
to  begin.  In  this  sense,  a  process  behaves  exactly  like  a  node  in  a  dataflow  graph. 
However,  as  noted  before,  processes  contain  state  whereas  dataflow  nodes  do  not. 
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In  general,  when  an  input  message  is  processed,  several  output  messages  may 
be  generated  as  described  in  the  abstract  behavior.  These  output  messages  are  sent 
out  on  the  appropriate  output  channels  before  the  next  input  message  is  processed. 

The  only  place  in  the  description  where  the  order  of  input  messages  needs  to  be 
clarified  is  where  the  function  CP  is  applied  to  the  sets  of  substitutions  on  the  input 
channels  to  produce  a  set  of  substitutions  on  the  virtual  input  set.  In  particular,  a 
new  function  CPnew  needs  to  be  defined  that  takes  n  input  streams  of  substitutions 
and  returns  one  stream  of  substitutions  to  be  considered  the  virtual  input  stream. 
Streams  are  represented  mathematically  as  tuples.  CPnew  is  defined  to  be  the 
composition  of  three  other  functions. 

CPnew  =  CPnewZ  o  CPne w2o  CPnewl 

CPnew  1  takes  as  input  n  streams  of  substitutions  and  returns  one  stream  of 
n-tuples  (of  substitutions).  Let  the  input  streams  be: 

<  5l,l,  Sit2,  •  •  • ,  <?!,/,  > 

<  $2,1  >  S2,2i  ■  •  •  ,  S2J2  > 

*5n,l>  *Sn,2>  •  •  •  »  Sn,ln  ^ 

The  lengths  of  the  streams  are  /*,  l2, . . . ,  ln  us  shown.  The  virtual  input  stream 
is  specified  by  the  elements  that  it  contains  and  a  total  order.  The  elements  that  it 
contains  are  all  l2  x  l2  x  . . .  x  /„  n-tuples  of  the  form: 

<  >  S2,i2  ,  •  .  .  ,  *?n,»n  > 

where  1  <  ij  <  lj.  As  can  be  seen  from  the  prototypical  tuple,  its  kth  element 
comes  from  the  kth  stream  for  all  k  such  that  1  <  k  <  n. 

The  order  of  the  elements  in  the  output  stream  is  constrained  only  by  a  partial 
order  to  be  described  shortly.  Since  a  total  order  is  required  for  a  stream,  any 
particular  total  order  that  satisfies  the  partial  order  is  acceptable.  Two  prototypical 
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elements  PE  1  and  P E2  are  ordered  if  n  —  1  of  their  constituent  elements  are  the 
same  and  the  nth  is  different.  For  example, 


PE  1  =<  Si 
PE2  =<  St 


c . . 

,*!>•••»  >  *  • 
,»!?•••}  Sj,ij2  >  *  * 


C  . 

•  >  Jn,in 

C  . 

•  J 


> 

> 


where  the  S{j  substitutions  are  as  given  above.  In  this  c-ise,  PEI  will  precede  P E2 
in  the  output  stream  if  and  only  if  in  <  ij2.  Similarly,  PE2  will  precede  PEI  in 
the  output  stream  if  and  only  if  ij2  <  ij\. 

As  a  specific  example,  consider  the  case  when  there  are  two  input  streams.  Let 
one  input  stream  be  represented  by  the  tuple  <  51,52  >  and  the  other  stream 
by  <  53,54  >.  There  would  be  four  elements  in  the  output  stream:  <  51,53  >, 
<  52, 53  >,  <  51,54  >,  and  <  52,54  >.  Let  indicate  the  ordering  predicate. 
The  ordering  constraint  described  above  would  force  the  following  partial  order: 


<  51,53  >-<<  52,53  > 

<  51,53  >-«  51,54  > 

<  52, 53  >-<<  52, 54  > 

<  51,54  >-<<  52,54  > 

Therefore,  the  output  stream  is  one  of  either 

«  51,53  >,<  52,53  >,<  51,54  >,<  52,54  » 


or 


«  51,53  >,<  51,54  >,<  52,53  >,<  52,54  » 

The  motivation  for  this  ordering  constraint  is  that  it  is  similar  in  spirit  to  the 
first-in-first-out  and  incremental  processing  that  is  used  for  a  straight  dataflow 
solution.  This  completes  the  definition  of  CPneivl.4 

4Notice  that,  strictly  speaking,  one  would  have  to  remove  any  duplicates  in  the  output  of  CPnew  1 
if  one  is  to  think  of  it  as  a  set  (or  an  ordered  set).  Typically,  implementations  of  logic  programming 
languages  do  not  prune  out  duplicates  in  the  interest  of  efficiency.  In  the  same  spirit,  the  implemen¬ 
tation  of  PM  does  not  remove  duplicates  either.  Therefore,  if  one  is  to  be  mathematically  correct, 
the  collections  of  substitutions  along  streams  should  be  called  bags  (or  ordered  bags). 
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CPne w2  is  applied  to  the  output  of  CPnew  1.  CPnew 2  takes  one  stream  as 
its  input  and  returns  one  stream  as  its  output.  Each  element  in  the  input  is  an 
n-tuple  of  substitutions.  The  output  of  CPnew 2  is  a  stream  with  exactly  the  same 
number  of  elements  as  the  input  stream.  The  elements  of  the  output  stream  are 
obtained  by  applying  the  Merge  function  (used  in  the  description  of  CP)  to  the 
corresponding  elements  of  the  input  stream  (i.e.,  elements  in  the  same  positions). 
Note  that  Merge  takes  n  input  substitutions  and  returns  a  substitution  or  a  special 
element  _L.  The  n  input  substitutions  in  this  case  are  the  n  constituent  elements 
of  each  element  of  the  input  stream  to  CPnew2.  The  Merge  function  is  used,  as 
before,  for  the  purpose  of  filtering  out  bad  combinations  of  substitutions.  As  an 
example,  if  the  input  to  CPnew 2  were 

«  {X  =  xl,Y=yl},{X  =  x2, Z  =  zl}  >,<  {X  =  x2,Y  =  y2),{X  =  x2,Z  =  zl}  >, 

<  {X  =  zl,  Y  =  yl},  {X  =  x3,  Z  =  z2}  >,<{X  =  x3,  Y  =  yl},  {X  =  x3,Z  =  z2 }  » 
then  the  output  would  be 

<±,{X  =  x2 ,Y  =  y2,Z  =  zl},±,{X  =  x3,Y  =  yl,Z  =  z2}  > 

The  output  of  CPnew2  is  the  input  to  CPnew3.  CPnew3  takes  one  stream 
as  its  input  and  returns  one  stream  of  substitutions  as  its  output.  The  output  of 
CPnew3  is  exactly  the  same  as  its  input  except  that  all  the  ±  elements  are  filtered 
out.  All  the  non-_L  elements  in  the  input  stream  are  retained  in  the  output  stream 
with  the  same  order.  As  an  example,  if  the  input  to  CPnew3  were 

<_L,{X  =  x2,Y  =  y2,Z  =  zl},l,{X  =  x3,Y  =  yl,Z  =  z2}  > 
then  the  output  would  be 

<{X  =  x2 ,Y  =  y2, Z  =  zl}, {X  =  x3,Y  =  yl, Z  =  z2}  > 

The  output  of  CPnewZ  is  the  output  of  the  top-level  function  CPnew.  This 
completes  the  definition  of  CPnew.  The  resulting  stream  obtained  by  an  application 
of  CPnew  to  some  stream  arguments  will  be  called  the  cartesian  product  of  the 
streams. 
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Message  Content  The  message  content  is  designed  to  not  require  any  spe¬ 
cial  messages  to  create  processes  initially.  There  is  enough  information  in  the 
messages  that  a  process  can  be  created  when  the  first  message  for  the  process 
arrives.5  To  make  this  possible,  each  message  contains  more  than  just  a  substi¬ 
tution.  In  fact,  each  message  contains  a  task.  A  task  includes  a  substitution  as 
well  as  a  two-terminal  DAG  of  literals  representing  a  conjunctive  goal.  The  two- 
terminal  DAG  for  a  task  along  an  input  or  output  channel  is  the  subgraph  that 
can  be  reached  from  the  channel  up  to  and  including  the  first  tail  node.  For  ex¬ 
ample,  consider  figure  12  again.  Tasks  along  the  channel  from  the  Head  node 
to  the  “b(W,Y)”  node  would  include  the  nodes  labeled  “b(W,Y)”,  “c(Y,U101)”, 
“d(Y,V102)’\  “e(Ul01,Vl02,Z)”  and  the  Tail  node.  Similarly,  tasks  along  the 
channel  from  the  node  labeled  “b(W,Y)”  to  the  node  labeled  “c(Y,U101)”  would 
include  the  nodes  labeled  “c(Y,U101)’\  “e(U101,V102,Z)w  and  the  Tail  node.  The 
two-terminal  DAG  for  tasks  on  task/subtask  channels  is  the  graph  for  the  entire 
conjunctive  subgoal  (including  the  Head  and  Tail  nodes).  The  two-terminal  DAG 
for  tasks  on  the  solution/subsolution  channels  is  empty. 


End-of-Stream  Message  Another  feature  that  is  added  in  the  detailed  descrip¬ 
tion  is  end-o f-stream  messages.  These  are  special  messages  that  are  sent  on  streams 
after  the  last  regular  message  has  been  sent.  The  advantage  of  this  feature  is  that 
the  top  level  process  can  tell  when  it  has  produced  the  last  answer.  This  is  the 
only  place  in  the  description  of  PM  that  temporal  ordering  of  messages  on  streams 
is  necessary.  There  are  many  ways  of  doing  this  with  much  less  overhead  than  the 
case  in  which  all  messages  on  a  stream  are  required  to  be  temporally  ordered.6 

The  rest  of  this  section  contains  detailed  descriptions  of  all  the  state,  messages, 
and  procedures  required  for  the  basic  execution  model. 


5It  may  still  be  the  case  that  additional  messages  to  set  up  later  processes  concurrently  with 
processing  of  the  earlier  processes  in  the  DAG  may  be  more  efficient. 

6 For  example,  the  end-of-stream  message  may  include  the  number  of  messages  that  have  been 
sent  on  the  stream  so  far.  The  destination  node  must  also  keep  a  counter  of  messages  received. 
When  an  end-of-stream  message  is  received,  its  processing  is  postponed  till  the  right  number  of 
regular  messages  is  received  first. 
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2.3. 4. 2  State 

Each  processor,  process  and  task  has  a  system-wide  unique  name,7  In  the  rest  of 
this  thesis,  typical  names  for  processors,  processes,  and  tasks  will  be  of  the  form  Pj, 
PSj,  and  Tj  respectively. 

Each  processor  maintains  the  following  state  information  on  the  tasks  and  pro¬ 
cesses  for  which  it  is  responsible: 

Work-Set:  This  is  a  set  of  tasks  that  the  processor  may  work  on. 

Task:  Each  task  is  a  5-tuple  of  the  form: 

<  Task- Name,  Task- Description,  Subtasks,  Spawning-Process-Name,  Parent-  Task- 
Name> 

Task-Name  is  the  system-wide  unique  name  of  the  task.  Task-Description  is 
the  description  of  the  task.  This  field  contains  the  substitution  that  was  described 
earlier  as  the  sole  content  of  a  message.  This  field  will  be  described  in  more  detail 
below.  The  cartesian  product  of  input  streams  of  tasks  produces  a  single  virtual 
input  stream  of  tasks.  The  task  description  field  of  each  task  in  the  virtual  input 
stream  gets  its  substitution  exactly  as  described  in  the  behavioral  description.  For 
each  task  that  is  generated  by  the  cartesian  product  function,  the  Spawning-Process- 
Name  field  is  set  to  the  name  of  the  process  that  applied  the  cartesian  product.  This 
field  is  empty  for  any  other  tasks.  Again,  for  every  task  in  the  virtual  input  stream, 
multiple  subgoals  may  be  created  by  the  application  of  rules  that  can  reduce  the 
goal  represented  by  the  task.  The  reduced  goals  are  represented  as  tasks  and  the 
name  of  each  such  reduced  task  is  a  member  of  the  Subtasks  field  of  the  parent  task. 
Similarly,  child  tasks  (i.e.,  tasks  that  are  produced  by  the  goad  reduction)  have  their 
Parent- Task- Name  field  set  to  the  name  of  the  parent  task.  Variables  in  the  rule 
must  be  “standardized  apart”  before  unification  with  the  goal  literal. 

Task-Description:  Each  is  a  tuple  of  the  form:  <CG,  BL> 

CG  (or  Conjunct  Graph)  is  a  two-terminal  DAG.  BL  is  a  substitution.  The 
nodes  in  the  graph  are  processes  (as  specified  below). 

'  All  that  is  needed  for  this  to  work  is  that  each  processor  have  a  unique  name  and  each  processor 
have  a  processor-wide  unique  name  generator.  Unique  system-wide  names  for  processes  and  tasks 
can  now  be  generated  by  combining  the  system-wide  unique  processor  name,  where  the  process  or 
task  is  to  be  generated,  with  a  processor-wide  unique  name. 
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Process:  Each  is  a  10-tuple  of  the  form: 

<Process-Name,  Literal,  Processor-Name,  Number- Inputs,  Input- Queues,  Outputs, 
Spawned-Task-Names,  Type,  Child- Task- Name,  Invocation-Substitution > 

Process-Name  is  the  system-wide  unique  name  of  the  process.  Literal  is  the 
literal  that  the  process  is  responsible  for  solving.  Processor-Name  is  the  name  of 
the  processor  where  the  process  resides.  Number-Inputs  is  the  number  of  inputs 
to  the  process.  Input- Queues  are  the  queues  of  messages  waiting  to  be  processed 
at  the  inputs  to  the  process.  These  queues  contain  additional  state  to  (1)  indicate 
whether  the  end-of-stream  message  has  been  received  and  (2)  give  the  status  of  the 
cartesian  product  formation  from  the  inputs  (more  later  on  this).  Outputs  are  a 
set  of  tuples  specifying  the  inputs  of  other  processes.  Each  tuple  is  of  the  form 
<process-name,  processor-name,  input-numb er>.  Spawned-Task-Names  is  a  set  of 
task  names.  The  names  correspond  to  tasks  that  are  created  by  cartesian  product. 
In  case  no  cartesian  product  is  necessary  (when  there  is  only  one  input),  the  unmod¬ 
ified  task  names  from  the  inputs  are  directly  included  in  Spawned-Task-Names.  The 
Type  of  the  process  can  be  one  of  {Normal,  Head,  Tail}.  The  Invocation-Substitution 
has  been  described  before. 

A  complete  process  specification  as  given  above  is  not  necessary  for  each  node 
in  the  conjunct  graph  of  a  task  specification.  A  partial  specification  as  given  below 
is  sufficient.  “ xxx ”  indicates  an  unspecified  field. 

<Process-Name,  Literal,  Processor-Name,  Number-Inputs,  xxx,  Outputs,  xxx,  Type, 
xxx,  xxx>. 

The  unspecified  fields  are  Input-Queues ,  Spawned-Task-Names ,  Child-Task- 
Name  and  Invocation-Substitution  from  left  to  right. 

Notice  that  the  Processor-Name  field  is  included.  In  particular,  this  means  that 
a  process  that  creates  a  subgoal/subtask  must  bind  all  processes  in  the  conjunct 
graph  of  the  subtask  to  specific  processors.  In  case  multiple  choices  exist  (when 
certain  subsets  of  the  database  are  replicated),  resource  allocation  procedures  must 
be  invoked  to  make  the  choice. 
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2. 3. 4.3  Messages 
Messages  are  4-tuples  of  the  form: 

<Message-Type,  Source-Processor-Name,  Destination-Processor-Name,  Arguments> 

For  now,  only  one  message  type  is  required.  More  types  are  required  for  the 
extensions  to  the  basic  execution  model.  The  type  needed  now  is  Input-  Task.  For 
this,  the  Arguments  field  is  a  tuple  of  the  form: 

<Destination-Process-Name,  Destination-Input-Number,  Task-Name,  Task-Description> 

The  fields  have  self-explanatory  names.  End-of-stream  is  indicated  with  “EOS” 
as  the  substitution  in  the  Task-Description. 

2. 3.4.4  Procedures 

As  mentioned  before,  the  database  of  rules/ assertions  is  distributed  before  any  goad 
is  ever  presented  to  the  system.  Also,  all  rules/assertions  that  can  be  used  to 
reduce  any  particular  task  are  in  a  single  processor.  It  turns  out  each  processor  in 
the  system  need  not  have  the  complete  partitioning  information  at  run-time.  Even 
the  partitioning  information  may  be  distributed.  A  processor  needs  to  know  only 
the  identity  of  processors  that  can  be  used  to  solve  each  literal  in  the  tails  of  the 
rules  that  it  contains  (i.e.,  each  literal  in  the  conjunctive  subgoals  that  it  generates 
itself).  The  processor  that  is  given  the  top  level  goal  must  know  the  identity  of  all 
processors  relevant  to  solving  each  literal  in  every  goal  that  may  be  presented  to 
the  system. 

When  a  process  on  a  processor  creates  a  subtask,  the  Head  and  Tail  processes 
associated  with  the  subtask  are  created  at  the  same  processor.  The  message  con¬ 
taining  an  empty  substitution  to  the  Head  process  can  be  replaced  by  a  function 
call.  Similarly,  the  output  messages  from  a  Tail  process  on  its  solution  stream 
may  be  replaced  by  function  calls  since  the  destination  of  the  messages  is  the  same 
processor.  Therefore,  messages  are  needed  along  input  and  output  channels  only. 
Messages  along  other  types  of  channels  can  be  replaced  by  function  calls. 
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As  mentioned  before,  every  task  on  an  input/output  channel  contains  the  two- 
terminal  DAG  that  can  be  reached  from  the  channel.  Therefore,  DAGs  in  tasks 
input  to  a  process  must  have  their  input  node  “stripped  off”  to  obtain  the  DAGs 
that  must  be  output  from  the  process.  The  cost  of  this  procedure  is  simply  the  cost 
of  traversing  the  two-terminal  subgraphs  that  can  be  reached  from  the  outputs  of 
the  process. 

The  cartesian  product  function  was  described  earlier.  One  interesting  feature  of 
this  function  is  that  it  can  be  computed  “incrementally”.  As  messages  arrive  on  the 
input  channels  of  a  process,  they  are  kept  in  a  FIFO  queue.  Consider  the  situation 
when  there  are  some  messages  in  the  queues  and  a  message  arrives  on  one  of  the 
channels.  The  virtual  tasks  that  can  be  created  out  of  the  combination  of  this  task 
with  the  tasks  waiting  in  other  queues  may  be  immediately  computed.  Of  course, 
the  order  of  these  newly  generated  virtual  input  tasks  on  the  virtual  input  stream 
must  satisfy  the  order  prescribed  by  the  cartesian  product  function.  Clearly,  if  the 
cartesian  product  function  is  going  to  be  computed  incrementally,  then  some  state 
needs  to  be  kept  to  indicate  the  extent  to  which  the  cartesian  product  has  been 
computed  at  tiny  given  time. 

As  mentioned  before,  the  last  message  on  each  stream  is  a  special  end-of-stream 
message.  Special  care  must  be  taken  to  send  these  messages  only  when  all  other 
messages  have  been  sent  on  a  stream.  In  particular,  a  normal  process  will  send  end- 
of-stream  messages  on  its  output  channels  (one  on  each)  when  the  conjunction  of  the 
following  three  conditions  is  satisfied:  (1)  All  input  channels  have  received  an  end- 
of-stream  message.  (2)  All  tasks  on  the  virtual  input  stream  have  had  their  subtasks 
created.  (3)  All  subsolution  channels  have  received  an  end-of-stream  message.  For  a 
Tail  process,  since  there  are  no  subsolution  channels,  condition  (2)  may  be  left  out. 
In  addition,  the  end-of-stream  message  is  not  sent  on  any  output  channel  (since  the 
Tail  process  does  not  have  one)  but  it  is  sent  on  the  solution  channel.  In  the  case  of 
Head  processes,  only  one  input  message  is  received  on  the  task  channel.  Therefore, 
the  end-of-stream  need  not  be  sent  explicitly.  For  messages  output  from  a  Head 
process,  each  channel  carries  two  messages  exactly — an  input-task  message  and  an 
end-of-stream  message. 
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Figure  14:  An  Example  Database 

2.3.5  A  Complete  Example 

As  mentioned  before,  a  dataflow*  graph  is  the  graph  of  process  nodes  that  is  gener¬ 
ated  during  the  execution  of  PM.  However,  just  as  a  syntactic  and-or  tree  is  easier  to 
view  than  a  complete  and-or  tree,  a  syntactic  version  of  the  dataflow*  graph  is  eas¬ 
ier  to  comprehend.  In  the  syntactic  version,  a  process  is  connected  (by  subtask  and 
subsolution  channels)  to  a  single  copy  of  the  subgoal  graph  for  each  rule/assertion 
that  applies  to  the  literal  associated  with  the  process. 

Consider  the  example  database  shown  in  figure  14.  The  distribution  of  the 
database  is  also  indicated  in  the  figure.  In  the  example,  the  database  is  partitioned 
on  the  basis  of  predicate  symbols  and  each  subset  is  resident  on  a  single  processor. 

A  graphical  abbreviation  is  used  to  reduce  the  complexity  of  the  dataflow*  graph 
of  the  example.  This  abbreviation  is  shown  in  figure  15. 

The  syntactic  dataflow*  graph  associated  with  the  database  for  the  query  r(X,Y) 
is  shown  in  figure  16.  Solid  boxes  indicate  processes.  The  literals  inside  the  boxes 
are  the  literals  to  be  solved  by  the  processes.  The  exceptions  are  the  Head  and  Tail 
process  pairs  which  are  shown  as  boxes  with  “H/T”  inside.  Dashed  lines  around  sets 
of  boxes  indicate  that  those  processes  reside  in  the  same  processor.  The  name  of  the 
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Task  Solutions 


1 


Figure  15:  Graphical  Abbreviation  for  Dataflow*  Graphs 
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processor  is  indicated  as  a  name  of  the  form  Pi.  Arcs  that  cross  dashed  lines  indicate 
streams  of  input-task  messages.  Task  names  (of  the  form  Ti)  are  written  next  to  the 
arcs.  indicates  temporal  sequencing.  Arcs  inside  dashed  lines  indicate  function 
calls  within  the  same  processor  to  set  up  child  tasks  (downward  arcs)  and  to  send 
solutions  to  parent  tasks  (upward  arcs). 

The  top  level  task  is  Tl  at  processor  Pi.  It  turns  out  that  it  has  only  one  literal 
“r(X,Y)”  to  solve.  In  general,  there  could  be  an  arbitrary  number. 

Notice  a  couple  of  different  dataflow*  subgraphs  for  child  tasks.  The  conjunctive 
goal  “p(X),q(Y),s(X,Y)”  leads  to  the  conjunct  graph  with  “p(X)”  and  “q(Y)”  solved 
in  parallel  followed  by  “s(X,Y)’\  In  the  case  of  the  conjunctive  goal  “m(X),n(X,Y)”, 
the  two  literals  must  be  solved  sequentially  because  they  share  the  variable  “X”. 

Finally,  figure  17  shows  some  abbreviated  task  descriptions.  To  avoid  cluttering 
up  the  figure,  task  tuples  have  been  abbreviated  to  the  shortened  tuples 

<  Task-Name, BL, Parent- Task-Name,  CG> 

where  BL  is  the  associated  substitution  and  CG  is  the  conjunct  graph. 

The  sample  task  shown  on  top  contains  mnemonic  field  names  to  make  it  easier 
to  decode  the  fields  in  the  examples.  In  addition,  the  process  nodes  in  the  CGs  (or 
Conjunct  Graphs)  are  abbreviated  to  just  the  associated  literals.  The  Invocation- 
Substitution  for  Tail  nodes  is  shown  directly  below  the  boxes  representing  them. 

Tl  is  the  task  representing  the  top-level  goal.  T2  and  T3  are  two  solutions 
for  the  top-level  task.  T4  is  the  end-of-stream  message  for  the  solution  stream 
associated  with  Tl.  In  fact,  the  last  task  in  each  stream  (except  streams  going  to 
Head  processes)  is  a  similar  end-of-stream  message. 

A  child  task  such  as  T33  gets  a  variable  renamed  X101  uniquely  because  variables 
in  rules  are  “standardized  apart”  before  unification  with  goals. 

Each  process  is  responsible  for  solving  the  input  node  in  the  conjunct  graph  of 
each  task  on  its  input  streams.  The  outgoing  tasks  are,  therefore,  the  incoming 
tasks  with  the  input  node  “peeled  off”.  For  example,  look  at  task  T47  and  task 
T49.  In  general,  “peeling  off”  the  input  node  of  a  task  can  create  multiple  tasks. 


[ 


CHAPTER  2.  PM:  A  PARALLEL  EXECUTION  MODEL 


Figure  16:  Dataflow*  Graph  for  Example 
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<  TN,  BL.  PTN,  CG  >  Sample  Task 


<  T1 ,  {},  nil, 

<  T2,  {X=a,Xl01=b,Y=b},  nil, 


nil  > 


<  T3,  {X=b,X101=b,Y=a},  nil,  nil  > 

<  T4,  EOS,  nil,  nil  > 


<  T16,  0. 

<  T33,  0. 

<  T47,  0. 

<  T49,  {X101=a}, 

<  T50,  {X101=b}, 

<  T52,  {X101=b,Y=a}, 

<  T62,  {Y=a}, 


IS-{} 
T50,  nil  > 


Figure  17:  Some  Abbreviated  Task  Descriptions 
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The  Composition  function  is  applied  to  pairs  of  (1)  substitutions  received  by 
a  process  on  its  input  tasks  with  (2)  substitutions  received  from  the  solution  of 
the  child  tasks.  The  composed  substitutions  are  sent  out  with  the  outgoing  tasks. 
For  example,  observe  how  the  result  of  Composition ,  from  its  application  to  the 
substitution  in  input  task  T50  and  the  substitution  in  T62,  is  the  substitution  in 
output  task  T52. 

Cartesian  product  of  multiple  input  streams  at  a  process  creates  new  tasks  with 
new  names.  T37,  T38,  T39,  and  T40  are  skipped  in  the  task  numbering  shown 
in  figure  16  because  they  are  created  internally  in  processor  P6  from  the  cartesian 
product  of  T18,  T19  and  T21,  T22. 

2.3.6  Remarks  on  Efficiency 

This  section  contains  comments  on  some  efficiency  issues  related  to  the  basic  exe¬ 
cution  model. 

Distributed  Environments:  As  described  earlier,  substitutions  of  tasks  on  vir¬ 
tual  input  streams  are  retained  in  a  process  when  subgoals  are  set  up.  This  is 
the  distributed  environment  approach.  An  alternative  would  be  to  send  complete 
copies  of  environments  to  child  tasks.  This  could  be  accomplished  by  replacing 
the  substitution  field,  as  it  stands  currently,  by  a  stack  of  substitutions.  However, 
the  disadvantage  with  the  “copying”  approach  is  that  communication  costs  will  be 
higher  and  perhaps  unacceptable.  The  disadvantage  of  distributed  environments  is 
that  subsolutions  must  be  returned  to  the  process  generating  subtasks  so  that  the 
Composition  function  may  be  applied  to  the  input  substitutions  paired  with  the 
subsolution  substitutions. 

Number  of  Tasks  Generated:  In  the  example  shown  in  section  2.3.5,  66  tasks  were 
generated.  If  a  sequential  Prolog  interpreter  were  used  with  the  same  database,  the 
number  of  logical  inferences8  used  would  have  been  15.  One  might  ask  if  the  66/15 
ratio  of  the  number  of  tasks  to  the  number  of  sequential  logical  inferences  reflects 
on  the  inefficiency  of  the  model.  As  it  turns  out,  the  66/15  ratio  is  completely 

8A  logical  inference  is  defined  to  be  a  successful  reduction  of  a  literal  goal  by  either  one  assertion 
or  one  rule. 
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misleading.  The  number  of  tasks  generated  in  a  dataflow*  graph  is  not  a  good 
indicator  of  the  cost  of  the  model  as  will  be  shown  below. 

Several  simple  optimizations  can  be  used  to  do  away  with  a  large  number  of 
tasks  entirely  and  many  other  tasks  involve  trivial  amounts  of  computation.  In 
particular,  end-of-stream  tasks  need  not  be  sent  separately.  Each  end-of-stream 
task  can  be  piggybacked  on  the  last  regular  task  sent  on  the  stream  in  question. 
Another  optimization  is  to  replace  tasks  on  the  solution  channels  of  Tail  processes 
with  function  calls.  Since  each  such  function  call  involves  very  little  work  (i.e., 
Composition  of  two  substitutions  or  checking  whether  an  end-of-stream  task  should 
be  sent  on  the  output  streams  of  the  normal  process),  we  will  ignore  these  in  the  cost 
calculation.  Also,  the  Head  nodes  merely  serve  as  routers  of  data.  Therefore,  tasks 
on  the  task  channels  of  Head  nodes  will  be  ignored  as  well  in  the  cost  calculation. 
Also,  tasks  on  the  input  channels  to  Tail  processes  lead  only  to  cartesian  product 
but  not  to  any  logical  inferences.  We  will  ignore  these  tasks  as  well.  The  cost 
of  cartesian  product,  in  general,  will  be  considered  separately  later  in  this  section. 
After  having  removed  all  tasks  from  the  example  that  are  to  be  ignored  as  described 
above,  we  notice  that  only  10  tasks  remain  for  which  logical  inferences  may  need  to 
be  performed.  These  tasks  are  T5,  T14,  T16,  T18,  T19,  T21,  T22,  T47,  T49,  and 
T50. 

However,  to  make  a  comparison  of  cost  between  PM  and  sequential  Prolog, 
even  this  number  of  tasks  remaining  can  be  misleading.  One  should  really  consider 
the  number  of  logical  inferences  that  are  associated  with  the  remaining  tasks.  On 
doing  the  arithmetic,  we  find  that,  in  this  particular  case,  the  number  of  logical 
inferences  in  the  example  is  also  10.  Notice  that  this  is  less  than  the  number  of 
logical  inferences  (15)  in  the  sequential  Prolog  case. 

The  number  of  logical  inferences  in  dataflow*  graphs  is  highly  dependent  on 
the  partial  order  that  is  chosen  for  conjunctive  goals.  By  choosing  a  bad  partial 
order,  it  is  possible  to  have  a  higher  number  of  logical  inferences  in  dataflow* 
graphs  compared  to  sequential  Prolog.  It  also  turns  out  that  if  no  and-parallelism  is 
exploited,  and  the  only  parallelism  exploited  is  or-parallelism  and  pipelining,  then 
the  number  of  logical  inferences  is  identical  for  both  PM  and  sequential  Prolog. 
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Also,  as  shown  in  the  example,  if  the  partial  orders  are  chosen  carefully,  then  the 
number  of  logical  inferences  can  be  reduced. 

In  addition  to  reducing  the  number  of  tasks,  one  can  also  reduce  the  number  of 
processes.  In  particular,  since  Head  processes  are  used  as  data  routers  only,  they 
do  not  have  to  be  created  explicitly.  Also,  both  Head  and  Tail  processes,  created 
when  an  assertion  is  used  to  reduce  a  literal  goal,  may  be  removed  because  there  is 
an  empty  DAG  between  them. 

Cost  of  Decomposition :  Partial  orders  need  to  be  generated  for  conjunctive  goals. 

As  mentioned  before,  these  partial  orders  are  of  the  same  type  used  by  Conery’s 
execution  model  [15].  Therefore,  his  algorithm  for  partial  orders  can  be  used  directly 
here.  Also,  appendix  A  describes  another  algorithm  that  is  used  in  PM  along  with 
the  associated  cost. 

Trade-off  between  Space  and  Time :  Non-shaxed  memory  architectures  (like  dataflow 
architectures  [70]  and  distributed  systems  [38])  have  the  property  that  extra  space 
may  be  consumed  in  the  attempt  to  reduce  time  of  execution.  This  can  happen  if, 
for  example,  two  parallel  operations,  02  and  03,  have  a  dataflow  dependency  on 
the  result  of  an  operation,  01.  If  memory  is  not  shared  and  all  three  operations  are 
on  different  processors,  then  copies  of  the  result  of  01  must  be  sent  to  the  proces¬ 
sors  associated  with  02  and  03.  In  a  shared  memory  architecture,  the  processors 
associated  with  02  and  03  could  simply  read  a  single  copy  of  the  result  01  from 
shared  memory.  The  target  architecture  of  PM  does  not  have  shared  memory  either 
and,  therefore,  shares  this  property. 

Cost  of  Cartesian  Product  of  Streams :  As  was  pointed  out  before,  cartesian 
product  of  streams  requires  extra  memory  compared  to  the  sequential  Prolog  exe¬ 
cution.  In  particular,  taking  the  cartesian  product  of  streams  requires  space  equal 
to  the  sum  of  the  lengths  of  the  individual  streams.  In  addition,  the  number  of 
elements  in  the  cartesian  product  may  be  equal  to  the  product  of  the  lengths  of 
the  streams  in  the  worst  case.  Notice  that  the  worst  case  is  reached  only  when  no 
composite  binding  leads  to  any  inconsistent  bindings. 

In  addition,  the  processing  cost  associated  with  the  cartesian  product  function  is 
of  the  order  of  the  product  of  the  lengths  of  the  streams.  The  situation  is  alleviated 
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somewhat  by  the  fact  that  the  constants  involved  in  the  processing  cost  are  fairly 
low.  In  particular,  the  most  costly  processing  operation  is  checking  to  see  whether 
a  composite  substitution  has  consistent  bindings.  Even  more  importantly,  however, 
one  can  save  on  more  costly  logical  inferences  by  using  PM  as  illustrated  in  the 
example. 

In  conclusion,  there  is  a  cost  to  taking  the  cartesian  product  of  streams  and  this 
could  be  substantial  in  the  worst  case.  However,  the  additional  parallelism  gained 
may  outweigh  the  cost.  The  total  number  of  logical  inferences  may  be  reduced  as 
well  and  this  may  make  PM  an  attractive  option  for  sequential  processors  in  some 
cases.  The  example  given  earlier  in  section  2.3.5  illustrates  the  effect  of  reduction 
in  the  number  of  logical  inferences  compared  to  the  sequential  Prolog  execution. 

Resource  Allocation :  As  mentioned  before,  the  design  of  parallel  execution  mod¬ 
els  is  just  one  of  many  difficult  problems  that  must  be  solved  to  make  multiprocess¬ 
ing  a  success.  Resource  allocation  is  one  such  problem.  Notice,  however,  that  this 
is  not  a  problem  restricted  to  this  particular  parallel  execution  model. 


2.4  Extensions  to  Basic  Model 

Three  extensions  to  the  basic  execution  model  are  described  in  this  section.  The 
first  two  deal  with  handling  storage  constraints  due  to  large  databases  and  long 
streams.  The  third  extension  deals  with  non-ground  bindings  of  variables. 


2.4.1  Handling  Storage  Constraints 

2. 4. 1.1  Large  Databases 

As  mentioned  before,  the  basic  execution  model  deals  with  the  case  where  all  clauses 
that  can  be  used  to  reduce  any  particular  atomic  proposition  goal  reside  in  a  single 
processor.  One  can  achieve  this  if,  for  example,  one  partitions  the  database  on  the 
basis  of  the  predicate  symbol  of  the  head  of  the  clause.  Each  partition  is  mapped 
onto  a  single  processor. 


56 


CHAPTER  2.  PM;  A  PARALLEL  EXECUTION  MODEL 


Of  course,  it  is  possible  that  a  particular  partition  may  not  fit  in  a  single  proces¬ 
sor  due  to  memory  constraints.  In  addition,  one  may  want  to  spread  a  partition  over 
many  processors  to  exploit  the  parallelism  in  a  single  backward-chaining  step.  The 
goal  proposition  may  be  unified  in  parallel  with  the  heads  of  the  relevant  clauses 
and  subtasks  may  be  created  in  parallel. 

The  solution  is  to  maintain  a  single  processor  as  being  responsible  for  each 
partition  (as  before).  However,  instead  of  the  clauses  in  a  partition  physically 
residing  in  the  responsible  processor,  the  clauses  are  distributed  over  a  certain 
neighborhood  of  the  processor.  One  could,  for  example,  distribute  the  partition 
over  all  processors  within  some  number  of  message  hops  away  from  the  responsible 
processor. 

Two  extra  message  types  are  required  now  to  make  the  subtask  creation  and 
solution  collection  possible.  The  message  types  are  Do-Task  and  Done-Task. 

The  Arguments  field  of  the  Do- Task  message  type  is  of  the  form: 

<Task-Name,  Task-Description,  Source-Process-Name> 

Task-Name  is  the  name  of  the  task  that  needs  to  be  worked  on.  Task- Description 
is  its  description.  Source-Process-Name  is  the  identity  of  the  process  that  is  sending 
the  message. 

The  Arguments  field  of  the  Done-Task  message  type  is  of  the  form: 
<Task-Name,  Child- Task- Name,  Destination-Process-Name,  Solution> 

Task-Name  is  the  name  of  the  task  that  was  originally  received  from  the  process 
with  name  Destination-Process-Name .  Child-Task-Name  is  the  name  of  the  child- 
task  of  Task-Name  that  was  created  and  Solution  is  the  substitution  that  is  being 
reported  as  an  answer  to  Child- Task- Name .  Again,  end-of-stream  may  be  indicated 
by  an  wEOSw  in  the  Solution  field. 

The  messages  are  used  in  the  following  way.  Processes  still  reside  at  the  processor 
responsible  for  the  relevant  partition.  The  relevant  partition  is  the  one  that  is 
relevant  to  solving  the  atomic  proposition  associated  with  the  process.  When  a 
process  receives  an  input-task  message,  it  finds  the  incremental  cartesian  product 
as  before.  The  new  tasks  are,  however,  not  solved  locally.  They  are  sent  to  the 
neighborhood  associated  with  the  relevant  partition  around  the  processor  using 
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Do- Task  messages  (i.e.,  each  processor  in  the  neighborhood  receives  a  copy  of  the 
Do-Task  message).  These  processors  in  the  neighborhood  create  subtasks  just  as 
the  single  responsible  processor  would  in  the  basic  execution  model.  The  difference  is 
that  solutions  must  be  communicated  back  to  the  responsible  processor  using  Done- 
Task  messages.  End-of-stream  is  indicated  as  before  with  the  Solution  argument  set 
to  “EOS”.  The  difference  here  is  that  each  processor  in  the  neighborhood,  including 
those  that  cannot  create  any  solutions  or  subtasks,  must  report  to  the  responsible 
processor  when  all  subtasks  have  been  generated  using  the  Done-Task  message. 
This  is  done  by  setting  the  Child- Task- Name  argument  of  the  message  to  nil  and 
the  Solution  argument  to  “EOS”.  In  the  basic  execution  model,  since  all  clauses 
that  could  be  used  to  create  a  subtask  were  in  a  single  responsible  processor,  the 
responsible  processor  knew  locally  when  all  possible  subtasks  had  been  created.  The 
new  mechanism  is  necessary  to  replace  knowledge  that  no  longer  resides  locally.9 

In  addition,  one  needs  to  maintain  a  flag  at  the  parent-task  to  indicate  whether 
all  possible  subtasks  have  been  found.  This  flag  is  false  when  a  task  is  first  created 
using  cartesian  product  at  a  process.  After  a  Do-Task  message  is  sent  out  to  the 
appropriate  neighborhood  and  alter  the  responsible  processor  has  received  an  indi¬ 
cation  from  each  processor  that  all  subtasks  have  been  generated,  then  the  flag  can 
be  set  to  true. 

Note  that  it  is  not  necessary  that  the  partition  be  distributed  over  some  neigh¬ 
borhood  of  a  certain  processor.  The  distribution  may  be  over  an  arbitrary  set  of 
processors.  This  extra  flexibility  may  be  useful  for  some  task  allocation  strategies. 

2. 4.1. 2  Long  Streams 

Processes  may  have  multiple  input  streams  whose  cartesian  product  has  to  be  com¬ 
puted.  To  create  this  cartesian  product,  essentially  the  process  has  to  store  complete 
streams  until  the  entire  cartesian  product  has  been  obtained.  Since  it  may  be  hard 
to  accurately  predict  the  lengths  of  these  streams  ahead  of  time,  it  is  possible  that 

9A  more  efficient  solution  to  propagating  Do-Task  and  Done- Task  messages  to/from  the  neigh¬ 
borhood  is  possible  but  the  idea  here  is  merely  to  show  that  a  satisfactory  solution  to  the  problem 
exists. 
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Rule 

Goal 


h(X,Y,Z,U,V) tl (X),  t2(Y).  t3(X,Y,Z),  t4(Z,U),  t5(Z,V) 
tl  (X),  t2(Y),  t3(X,Y,Z),  t4(Z.U).  tS(Z.V) 


Original  Dataflow*  Graph 


Additional  Dataflow*  Graph 


Figure  18:  Handling  Long  Streams 

the  processor  responsible  for  the  process  may  not  have  the  requisite  storage. 

The  solution  is  to  sequentialize  the  dataflow*  graph  upstream  from  the  process 
up  to  the  Head  process.  As  much  sequentialization  is  done  as  is  necessary  to  remove 
the  memory  problem.  In  the  worst  case,  the  sequentialization  may  lead  to  a  Unear 
sequence  of  processes  requiring  absolutely  no  cartesian  products  of  streams.  Of 
course,  this  means  that  no  and-parallelism  is  exploited.  Or-paralleUsm  and  pipelin¬ 
ing  will  continue  to  be  exploited  as  before.  Figure  18  shows  an  example  of  this 
process.  In  the  example,  the  node  corresponding  to  t3(X,Y,Z)  is  the  one  that  gets 
into  a  memory  constraint  situation. 

Notice  that  the  new  dataflow*  graph  must  exist  independently  along  with  the 
old  dataflow*  graph.  This  is  necessary  because  tasks/solutions  may  still  be  m 
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the  pipeline  m  the  original  dataflow*  graph  when  the  new  graph  is  introduced. 
Therefore,  more  than  one  Tail  process  may  exist  for  a  certain  task.  When  solutions 
flow  out  of  the  Tail  processes,  an  indeterminate  merge  of  these  streams  must  happen. 

Also,  an  EOS  is  sent  from  the  collection  of  Tail  processes  only  when  they  have  all 
produced  an  EOS. 


2.4.2  Handling  Non-Ground  Bindings 

If  a  process  produces  non-ground  bindings  for  the  atomic  proposition  associated 
with  it,  then  some  downstream  processes  that  work  in  parallel  may  not  be  able  to 
do  so  any  more.  Processes  should  execute  in  parallel  only  if  the  bindings  they  are 
expected  to  produce  are  not  for  any  common  variables.  A  non-ground  binding  from 
a  preceding  process  may  remove  this  necessary  condition. 

The  solution  is  more  or  less  complementary  to  the  solution  for  the  long  stream 
problem.  The  dataflow*  graph  downstream  from  the  process  in  question  is  sequen- 
tialized  as  much  as  necessary  in  order  to  avoid  the  problem.  Figure  19  shows  an 
example  of  this  process.  In  the  example,  the  node  t3(X,Y,Z)  is  expected  to  produce 
a  ground  binding  for  the  variable  Z  but  does  not.  Similar  to  the  long  stream  case, 

the  multiple  dataflow*  graphs  coexist  independently.  Multiple  Tail  processes  are 
handled  as  before. 


2.4.3  Handling  Multiple  Copies 

As  of  now,  only  one  copy  is  allowed  for  each  partition  of  the  database.  If  there  are 
goals  generated  in  parallel  that  use  the  same  partition,  this  restriction  may  lead  to 
a  bottleneck.  A  way  out  of  this  problem  is  to  allow  multiple  copies  of  partitions. 
The  solution  is  to  decouple  the  functions  of  a  normal  process  into  two  process  types: 
CP  and  normal-new.  The  function  of  the  CP  process  type  is  to  compute  cartesian 
products  only.  The  normal-new  process  type  does  the  rest  of  the  computation  that 
a  normal  process  type  did.  Figure  20  shows  graphically  the  interaction  between 
the  different  process  types.  As  indicated  in  the  figure,  two  new  message  types  are 
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Rule  h(X,Y,Z,U,V) tl  (X),  t2(Y),  t3(X,Y.Z),  t4(Z.U),  t5(Z.V) 

Goal  tl  (X),  t2(Y),  t3(X,Y,Z),  t4(Z,U),  t5(Z,V) 


Original  Dataflow*  Graph 


Additional  Dataflow*  Graph 


Figure  19:  Handling  Non-Ground  Bindings 
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Cluster  of  Copies 
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Figure  20:  Handling  Multiple  Copies 

required.  These  are:  Distribute- Task  and  Collect-Task.  The  Distribute- Task 
message  type  is  used  to  distribute  computation  to  multiple  copies  of  the  partition 
and  the  Collect-Task  message  type  is  used  to  collect  solutions  from  the  multiple 
copies  of  the  partition. 

The  Arguments  field  of  the  Distribute- Task  message  type  is  of  the  form: 
<Destination-Process-Name,  Task-Name,  Task-Description> 

The  Arguments  field  of  the  Collect-Task  message  type  is  of  the  form: 
<Destination-Process-Name,  Spawned- Task-Name,  Solution> 

Solution  is  the  substitution  that  is  being  reported  as  a  solution  to  the  task  whose 
name  is  Spawned- Task-Name.  As  before,  end-of-stream  is  indicated  by  an  “EOS” 
in  the  Solution  field. 

Just  as  in  the  case  of  handling  large  databases,  the  multiple  copies  may  be 
distributed  to  some  neighborhood  of  a  central  processor  or  they  may  be  in  some 
arbitrary  set  of  processors. 
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2.5  Discussion 

It  was  mentioned  earlier  that  side-effects  are  not  allowed  in  PM.  This  is  not  strictly 
true  because  benign  side-effects  that  do  not  affect  the  result  of  a  computation  but 
only  affect  the  efficiency  can  be  allowed.  A  side-effect  of  this  type  is  caching  of 
results.  In  general,  the  lack  of  general  side-effects  is  not  as  severe  a  problem  as 
it  might  seem.  Many  search  procedures  [48]  do  not  need  any  side-effects.  Specific 
applications  that  do  not  need  side-effects  include  diagnosis  [28]  and  test-generation 
[59] — both  for  digital  hardware. 

Also,  it  was  mentioned  that  PM  is  designed  only  for  non-shared  memory  ar¬ 
chitectures.  However,  it  is  not  hard  to  modify  PM  to  work  on  shared  memory 
architectures  as  well.  Going  the  reverse  route  (i.e.,  taking  a  shared  memory  al¬ 
gorithm  and  mahing  it  work  on  a  non-shared  memory  architecture),  is  typically 
harder. 

The  rest  of  this  section  compares  PM  to  related  parallel  execution  models.  The 
related  work  that  is  discussed  in  this  section  is  work  by  Conery  [15],  Singh  and 
Genesereth  [61],  Lindstrom  and  Panangaden  [41],  Ciepielewski  and  Haridi  [12],  Bic 
[8],  Clark  and  Gregory  [14],  Shapiro  [57],  Borgwardt  [9],  and  Furukawa  [25]. 

The  research  presented  in  this  chapter  builds  on  two  important  ideas.  One  is  the 
exploitation  of  and-parallelism  as  described  by  Conery  in  his  dissertation  [15].  The 
other  is  the  exploitation  of  or-parallelism  and  pipelining  as  described  by  Ciepielewski 
and  Haridi  [12],  Lindstrom  and  Panangaden  [41],  and  Singh  and  Genesereth  [61]. 
The  connections  of  PM  with  these  two  sets  of  ideas  are  described  below. 

Conery’s  execution  model  exploited  a  restricted  sort  of  and-parallelism.  This 
restriction  is  exactly  the  one  used  in  PM.  A  significant  difference  is  that  the  back¬ 
tracking  control  of  Conery  is  completely  abandoned.  Instead,  PM  uses  a  dataflow 
solution  (with  the  exceptions  described  before).  One  consequence  is  that  com¬ 
munication  is  reduced  because  all  communication  associated  with  backtracking  is 
absent.  A  second  consequence  is  that  control  is  more  decentralized.  In  general, 
Conery’s  and-processes  correspond  to  the  Head/Tail  process  pairs  used  in  PM  and 
Conery’s  or-processes  correspond  to  the  normal  processes  in  PM.  PM  does  not  have 
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the  Head/Tail  process  pairs  coordinate  the  activities  of  the  normal  processes  (in 
between)  as  Conery’s  model  had  the  and-processes  do  for  the  children  or-processes. 
A  third  consequence  is  that  parallelism  due  to  pipelining  comes  for  free  in  PM. 
Conery’s  execution  model,  on  the  other  hand,  sends  one  solution  at  a  time  along 
“dataflow”  arcs.  Further  solutions  axe  sent  on  the  prodding  of  another  level  of 
control  analogous  to  backtracking  in  sequential  Prolog. 

Haridi  and  Ciepielewski  [12],  Lindstrom  and  Panangaden  [41],  and  Singh  and 
Genesereth  [61]  showed  how  or-parallelism  and  pipelining  could  be  exploited  to¬ 
gether.  In  these  pieces  of  research,  conjunctive  goals  were  solved  from  left  to  right 
in  sequence.  PM  exploits  and-parallelism  also  by  using  the  idea  of  streaming  for 
pipelining  but  allows  the  total  order  of  conjuncts  to  be  changed  to  a  partied  order. 
Or-parallelism  is  exploited  as  before.  However,  the  cost  of  exploiting  the  additional 
parallelism  is  that  a  dataflow  solution  (modulo  indeterminate  merge)  has  a  non¬ 
dataflow  feature,  cartesian  product  of  streams,  added  to  it.  Although  cartesian 
product  does  require  state  to  be  maintained,  the  good  news  is  that  it  is  only  local 
state.  No  global  state  is  maintained. 

Bic  [8]  describes  another  data-driven  parallel  execution  model.  However,  this 
model  only  handles  a  restricted  form  of  Horn  clauses.  Specifically,  predicates  must 
be  binary,  functions  must  be  immediately  evaluable  during  execution,  and  no  struc¬ 
tured  terms  axe  allowed. 

Other  parallel  execution  models  have  made  use  of  programmer-supplied  annota¬ 
tions  to  control  the  parallelism.  Examples  include  Clark  and  Gregory’s  PARLOG 
[14],  Shapiro’s  Concurrent  Prolog  [57],  and  Borgwardt’s  execution  model  [9].  PM 
differs  from  these  execution  models  in  that  it  does  not  use  any  annotations.  An¬ 
other  difference  is  that  none  of  these  three  execution  models  exploits  pipelining  as 
exploited  by  PM.  Moreover,  Borgwardt’s  execution  model  is  restricted  to  shared- 
memory  architectures.  However,  these  execution  models  have  been  characterized  by 
the  exploitation  of  another  form  of  parallelism — stream  parallelism.  As  defined  by 
Conery  [15],  this  type  of  parallelism  involves  the  pipelining  of  structured  data.  For 
example,  if  two  functions  are  to  be  applied,  in  sequence,  to  a  list  of  data  elements, 
the  first  function  may  be  applied  to  the  elements  of  the  list  one  by  one  and  these 
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partial  results  may  be  sent  to  the  second  function  as  they  are  generated.  The  second 
function  may  be  applied  to  the  result  elements  as  they  are  generated  by  the  first 
function.  Typically,  this  form  of  parallelism  is  not  important  for  knowledge-based 
applications.  For  example,  the  diagnosis  [28]  and  test-generation  [59]  applications 
mentioned  before  do  not  contain  any  exploitable  parallelism  of  this  type. 

Matsumoto  et  al.,  in  their  backup  parallelism  modd  [25],  view  each  node  of 
the  and-or  tree  as  a  process.  Each  and  (or)  process  activates  descendant  or  (and) 
processes.  A  descendant  process  starts  searching  for  another  solution  right  after 
it  sends  a  solution  to  the  parent  process.  If  an  additional  solution  is  found  and 
it  is  not  needed  by  the  parent,  the  descendant  process  suspends.  If  the  process  is 
reactivated  by  the  parent  process  in  the  future,  it  immediately  returns  a  previously 
found  solution  or  continues  trying  to  find  a  solution.  Therefore,  one  level  of  or- 
parallelism  is  maintained  throughout  the  tree  of  processes.  PM  does  not  restrict 
the  level  of  or-parallelism. 

After  the  work  on  PM  was  originally  published  [63,62],  Li  [40]  came  up  with 
essentially  the  same  idea  independently  for  her  doctoral  dissertation.  She  calls  her 
parallel  execution  model  the  Sync  Model. 

The  list  of  parallel  execution  models  compared  to  PM  in  this  section  is  by  no 
means  exhaustive.  An  attempt  was  made,  however,  to  cover  all  major  categories 
that  are  relevant. 


2.6  Conclusions 

This  chapter  has  described  PM,  a  parallel  execution  model  for  backward-chaining 
deductions.  The  most  import  ant  contribution  of  this  chapter  is  that  PM  can  si¬ 
multaneously  exploit  or-parallelism ,  and-parallelism ,  and  pipelining.  This  is  more 
parallelism  than  is  exploited  by  other  execution  models  using  dataflow  principles, 
multiprocessors  with  no  shared  memory,  and  distributed  databases.  The  extra 
parallelism  can  be  an  important  advantage  in  a  situation  where  large  numbers  of 
processors  are  available.  Using  dataflow  principles  means  that  synchronization  over¬ 
head  is  minimized  and  the  inherent  parallelism  can  be  fully  exploited. 


Chapter  3 
Cost  Function 


Optimal  task  allocation,  even  for  relatively  simple  problems,  is  NP-complete  [43]. 
The  approach  taken  in  this  thesis  is  to  define  a  cost  function  that  quantifies  in¬ 
tuitive  notions  of  undesirable  allocations  and  yet  allows  for  efficient  computation 
and  recomputation.  This  chapter  describes  the  cost  function  formally  and  presents 
algorithms  for  its  computation  and  recomputation.  The  next  chapter  describes 
the  allocation  algorithms  that  use  this  cost  function  and  results  obtained  from  an 
implementation  of  the  allocator. 

This  chapter  is  organized  as  follows.  Section  3.1  gives  a  formal  definition  of 
the  cost  function.  The  next  two  sections  describe  algorithms  to  compute  this  cost 
function. 


3.1  Definition  of  Cost  Function 

3.1.1  Preliminary  Definitions 

The  logic  program  is  described  as  a  3-tuple  <  F,R,G  >,  where  F  is  the  set  of  facts 
(i.e.,  Horn  clauses  with  exactly  one  positive  literal  and  no  negative  literals),  R  is 
the  set  of  rules  (i.e.,  Horn  clauses  with  exactly  one  positive  literal  and  one  or  more 
negative  literals),  and  G  is  the  set  of  goals  (i.e.,  conjunctions  of  positive  literals). 
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Both  facts  and  goals  at  compile-time  may  contain  unknown  constants.  Unknown 
constants  exist  at  compile-time  only  and  represent  constants  at  run-time.  They  axe 
called  unknown  because  their  exact  values  at  run-time  are  not  known  at  compile¬ 
time.  Since  facts/goals  with  unknown  constants  may  represent  one  of  potentially 
many  actual  facts/goals  with  constants,  facts/goals  with  unknown  constants  may 
be  called  fact  patterns  or  goal  patterns.  For  example,  a  fact  pattern  p(uc),  where  uc 
is  an  unknown  constant,  may  represent  either  of  facts  p(cl)  or  p(c2),  where  cl  and 
c2  are  actual  constants. 

Fact  and  goal  patterns  may  be  specified  with  an  associated  number.  The  number 
represents  the  expected  number  of  instances  of  those  fact  and  goal  patterns  at 
run-time.  An  instance  of  a  fact/goal  pattern  is  a  fact/goal  with  specific  values 
substituted  for  all  unknown  constants  in  the  fact/goal  pattern. 

L  |s,  where  L  is  a  literal  and  5  is  a  substitution,  denotes  the  literal  obtained  by 
applying  the  substitution  S  to  the  literal  L. 

A  cluster  of  processors  is  defined  to  be  a  set  of  processors  that  includes  a  central 
processor  for  the  cluster  and  all  processors  within  some  specified  distance  away  from 
the  central  processor.  The  size  of  the  cluster  can  vary  depending  on  the  maximum 
distance  allowed  from  the  central  processor  and  processors  on  the  periphery  of  the 
cluster.  Given  the  FAIM-1  topology  els  described  in  1,  these  cluster  sizes  can  be 
3E(E  —  1)  + 1  for  positive  integer  E.  The  maximum  number  of  processors  in  a  cluster 
is  restricted  to  be  less  than  or  equal  to  the  maximum  number  of  processors  in  the 
multiprocessor. 


3.1.2  Assumptions 

The  algorithms  to  compute  the  cost  function  depend  on  the  assumptions  listed 
below: 


1.  Unknown  constants  must  represent  atomic  constants  at  run-time.  They  may 
not  represent  compound  terms. 
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This  assumption  is  chosen  so  that  it  will  be  possible  to  estimate  the  amount 
of  data  (in  bytes,  say)  that  will  be  used  to  represent  these  constants  at  run¬ 
time.  If  they  represent  arbitrary  structures  or  functionals,  then  it  may  not  be 
possible  to  estimate  the  amount  of  data  without  additional  information  from 
the  user. 

One  way  to  satisfy  this  assumption  is  to  not  allow  any  structures  or  function¬ 
als  at  all.  However,  one  does  not  have  to  be  this  strict  because  all  that  is 
required  is  that  unknown  constants  not  be  bound  to  structures  or  functionals. 
Chapter  4  (on  Allocation  Algorithms)  contains  an  example  that  has  to  do 
with  reasoning  about  a  digital  hardware  device.  In  that  example,  functionals 
are  present,  yet  unknown  constants  are  always  bound  to  atomic  constants. 

2.  There  are  no  recursive  clauses  in  F  U  R. 

With  arbitrary  recursive  clauses,  it  is  not  possible  to  estimate  the  amount 
of  communication  or  computation  because  the  problem  is  equivalent  to  the 
halting  problem.  (It  will  be  seen  in  section  3.1.3  that  estimating  the  amount 
of  communication  and  computation  is  necessary  to  compute  the  cost  func¬ 
tion.)  However,  in  certain  recursive  cases,  it  may  be  possible  to  estimate  the 
amount  of  communication  and  computation  automatically.  For  example,  if 
the  length  of  a  list  argument  gets  reduced  by  one  for  every  recursive  call,  then 
the  recursion  depth  can  be  estimated  to  be  the  length  of  the  list  and  it  should 
be  possible  to  estimate  communication  and  computation.  In  addition,  there 
may  be  other  cases  where  some  pragmas  (or  hints)  from  the  user  may  allow 
a  program  to  complete  the  rest  of  the  analysis.  For  example,  in  a  quick  sort 
program,  the  length  of  the  list  gets  reduced  to  half  for  every  recursive  call. 
Therefore,  the  recursion  depth  is  log2n ,  where  n  is  the  length  of  the  list. 

3.  Each  fact  in  F  is  ground  (i.e.,  no  variables  are  allowed  in  any  fact).  Rules  may 
contain  variables,  however. 

Again,  this  assumption  is  designed  so  that  proper  estimates  may  be  made  of 
the  amount  of  communication  and  computation.  In  particular,  this  assump¬ 
tion  makes  it  possible  to  know  which  DAG  will  be  used  for  a  particular  set  of 


68 


CHAPTER  3.  COST  FUNCTION 


conjuncts  in  a  conjunctive  goal.  Remember  that  section  2.4.2  had  described 
how  to  handle  non-ground  bindings.  This  complication  can  be  ignored  when 
this  assumption  is  made. 

4.  Equal  frequency  assumption:  An  unknown  constant  is  equally  likely  to  repre¬ 
sent  any  known  constant  in  the  associated  domain. 

For  lack  of  any  more  information,  this  assumption  seems  as  good  as  any.  The 
question  arises,  however,  of  what  to  do  if  more  precise  information  is  given 
about  the  probability  distribution  of  the  unknown  constant  values.  This  the¬ 
sis  does  not  make  a  contribution  here.  It  should  be  noted  though  that  this 
assumption  will  be  used  later  to  compute  the  probability  of  two  literals  unify¬ 
ing,  That  computation  is  completely  independent  of  other  parts  of  this  thesis. 
Therefore,  if  techniques  are  found  for  taking  other  probability  distributions 
into  consideration,  then  they  can  be  used  immediately  with  no  change  to  the 
rest  of  the  thesis. 

5.  Variable  independence  assumption:  During  unification  of  two  literals,  we  as¬ 
sume  that  each  distinct  variable  varies  independently  over  its  domain. 

Again,  this  assumption  is  made  to  allow  computation  of  the  probability  of 
unification  of  two  literals.  And  again,  this  issue  is  orthogonal  to  the  rest  of 
the  thesis.  Therefore,  other  techniques  for  estimating  probability  may  be  used 

freely. 

6.  Literal  independence  assumption:  Solutions  of  individual  conjuncts  in  a  con¬ 
junctive  goal  are  independent  of  each  other. 

The  same  comments  that  applied  to  the  two  previous  assumptions  apply  here 
as  well. 

7.  Multiple  copy  clustering  assumption:  Although  multiple  copies  of  a  single 
partition  may  be  distributed  over  the  set  of  processors  in  an  arbitrary  way,  we 
consider  the  restricted  case  in  which  all  processors  in  a  cluster  of  processors 
contain  a  copy  of  the  partition  (and  no  other  processors  contain  a  copy). 
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This  assumption  is  probably  not  going  to  be  the  best  distribution  of  multiple 
copies  for  all  applications.  However,  for  the  applications  considered  in  this 
thesis,  this  assumption  is  reasonable.  It  will  be  argued  in  chapter  4  while 
discussing  experimental  results  that  this  assumption  is  reasonable  for  a  fairly 
wide  class  of  applications — the  class  of  applications  in  which  there  is  a  high 
degree  of  locality  of  computation.  Reasoning  about  digital  hardware  seems 
to  exhibit  this  locality.  Reasoning  about  other  physical  artifacts  may  exhibit 
this  locality  as  well. 

The  alternative  of  allowing  arbitrary  locations  of  copies  may  not  be  unfeasible 
but  leads  to  more  expensive  cost  computation/recomputation  and  allocation 
search  algorithms.  Therefore,  if  it  is  not  necessary,  as  in  the  applications 
considered  in  this  thesis,  then  it  is  best  to  use  the  “clustered  copies”  approach. 

8.  Multiple  copy  uniformity  assumption:  Again,  in  the  general  case,  multiple 
goals  associated  with  the  same  partition  may  be  distributed  in  an  arbitrary 
way  over  multiple  copies  of  the  partition.  We  consider  the  restricted  case  in 
which  goals  axe  uniformly  distributed  over  the  multiple  copies.  In  particular, 
the  uniform  distribution  is  done  by  assigning  any  new  goal  to  a  random  copy 
of  the  associated  partition. 

The  same  comment  that  applied  to  the  previous  assumption  applies  here  as 
well. 


3.1.3  Cost  Function 

C,  the  cost  function,  takes  an  allocation  as  defined  in  chapter  1  and  returns  a  non¬ 
negative  real.  Actually,  since  multiple  copies  are  restricted  to  clusters  as  described 
above,  an  allocation  can  now  be  restated  to  be  a  many-to-one  (instead  of  many-to- 
many)  mapping  of  partitions  to  processors.  The  processor  mapping  of  a  partition  is 
taken  by  convention  to  be  the  central  processor  in  the  associated  cluster  of  copies. 

I  will  now  give  some  motivation  for  the  cost  function  before  defining  it  formally. 
Every  parallel  computation  has  an  associated  parallelism  profile,  where  parallelism 
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Parallelism  Profile 

Figure  21:  Parallelism  Profile  of  a  Computation 

profile  is  defined  to  be  the  function  the  gives  the  number  of  busy  processors  versus 
time  assuming  unbounded  processors  and  memory,  and  instantaneous  communica¬ 
tion.  Let  us  say  that  the  profile  is  as  given  in  figure  21.  Now,  a  lower  bound  on  the 
completion  time  for  the  computation  for  any  practical  multiprocessor  is  given  by 
L  (because  any  practical  multiprocessor  will  have  a  bounded  number  of  processors 
and  non-zero  communication  delays).  If  A  is  an  allocation,  a  cost  function  C  can 
be  defined  as  follows: 

C\A)  =  L  +  CC{A)  +  PMC(A) 

where  CC(A)  (or  the  communication  cost  of  the  allocation)  is  the  additional  delay 
expected  due  to  non-zero  communication  delays  in  a  practical  multiprocessor  and 
PMC(A)  (or  processor  multiplexing  cost  of  the  allocation)  is  the  additional  delay 
expected  due  to  sequentialization  of  parallel  computations.  Notice  that  L  is  inde¬ 
pendent  of  any  allocation.  Therefore,  if  the  only  purpose  of  using  the  cost  function 
is  to  compare  multiple  allocations,  a  new  cost  function  C  can  be  defined  as  follows: 


C(A)  =  CC(A)  +  PMC(A) 
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In  general,  there  is  a  trade-off  between  CC  and  PMC.  Allocating  all  compu¬ 
tation  to  a  single  processor  makes  CC  zero.  However,  PMC  is  the  highest  for 
this  situation  since  all  parallel  computation  needs  to  be  sequentialized.  On  the 
other  hand,  if  the  computation  is  spread  out  among  as  many  processors  as  possible 
(assuming  for  now  that  there  is  no  shortage  of  processors),  then  PMC  is  lowest. 
However,  CC  is  the  highest  for  this  situation.  Finding  a  good  allocation  depends 
on  finding  a  good  tradeoff  between  CC  and  PMC. 

Notice  that  no  parallelism  is  exploited  within  any  given  partition.  Therefore  if 
the  dataflow*  graph  is  as  given  in  figure  22,  where  the  dashed  lines  enclose  compu¬ 
tation  within  partitions,  then  only  communication  and  parallelism  across  partition 
boundaries  make  contributions  to  CC  and  PMC  respectively. 


3.1.4  Communication  Cost  Function 

CC ,  the  communication  cost  function,  is  defined  to  be  the  sum  of  delays  of  all  the 
messages  that  need  to  be  sent.  This  is  an  upper  bound  on  the  extra  delay  that 
should  be  expected  due  to  non-zero  communication  delay.  The  upper  bound  will 
be  reached  if  all  the  communication  is  on  the  critical  path.  A  closer  upper  bound 
might  take  parallelism  of  communication  into  account  and  this  is  explored  a  bit 
in  chapter  4.  It  turns  out  that  the  current  definition  of  the  communication  cost 
function  works  quite  well  (as  will  be  seen  in  chapter  4). 

Let  delay(dt,ds)  be  the  time  taken  for  a  message  with  data  size  dt  to  travel 
from  a  source  to  a  destination  separated  by  distance  ds.  The  units  for  dt  and  ds 
could  be  bytes  and  hops  respectively,  for  example.  In  the  FAIM-1  multiprocessor, 
extensive  simulation  has  shown  [67]  that  the  delay  function  is  expected  to  be  of  the 
form  given  below. 


delay  (dt,  ds) 


K\  4*  K2  x  dt  +  K3  x  ds  if  ds  >  0 
0  otherwise 


(1) 


where  Ki,Ki,  and  K3  are  constants.  Note  that  ds  =  0  means  that  the  source 
and  destination  processors  are  the  same. 


Figure  22:  Partitioned  Dataflow*  Graph 
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Formally,  we  can  say 

CC(A)  =  delay(data(Mj),distance(Mj))  (2) 

where  SM  is  the  set  of  messages  that  need  to  be  sent,  and  data  and  distance 
are  functions  that  give  the  data  size  and  distance  (between  source  and  destination 
processors)  of  a  message. 

As  will  be  seen  later  in  the  description  of  the  algorithm  to  compute  communi¬ 
cation  cost,  it  is  useful  to  reformulate  equation  2  using  equation  1  as  shown  below. 

CC(A)  =  '£'£sd<.>  (3) 

Vt  Vj 

where  SDij  is  the  sum  of  delays  for  all  messages  that  need  to  be  sent  from  partition 
i  to  partition  j.  Now,  if  these  partitions  are  mapped  to  the  same  processor,  we  have 

SDij  =  0  (4) 

Let  us  consider  the  other  case  in  which  the  two  partitions  axe  not  mapped  to  the 
same  processor.  Further,  let  the  distance  between  the  two  processors  be  dist(i,j). 
Now, 

SDij=  23  delay  (data(Mi),  dist(i,  j)) 

VM,€SMPij 

where  SMPij  is  the  set  of  messages  that  need  to  be  sent  from  partition  i  to  partition 
j.  Substituting  from  equation  1,  we  get 

SDij  =  2Z  (-Ki  +  data(Mi)  +  Jf3x  dist(i,j)) 

VMtZSMPij 

Now,  let  the  number  of  messages  sent  from  partition  i  to  partition  j  be  num(i,j) 
and  the  total  amount  of  data  in  all  these  messages  be  data(i,j).  Substituting  this 
into  the  above  equation  gives 


SDij  =  Ki  x  num(i,j)  +  K2  x  data(i,  j)  +  Kz  x  num(i,  j)  x  disi(i,j) 


In  summary,  we  have  the  following  equations  for  communication  cost: 
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CC(A)  =  j:ZSDi<j  (5) 

Vt  Vj 

SDij  =  (6) 

f  iifi  x  num(i,j)  +  Kt  x  data(i,j)  +  K$  x  num(i,j)  x  dist(i,j)  if  dist(i,j)  >  0 
|  o  otherwise 

Therefore,  to  compute  the  communication  cost  function,  it  is  sufficient  to  know 
the  total  number  of  messages  and  the  total  amount  of  data  to  be  sent  between  each 
ordered  pair  of  partitions.  Notice  on  the  right  hand  side  of  equation  6  that  only 
dist(i,j)  is  dependent  on  the  particular  allocation  being  considered.  Therefore,  if 
a  different  allocation  is  considered,  very  little  recomputation  needs  to  be  done  to 
compute  SDij  and  in  turn  CC . 

In  case  multiple  copies  of  partitions  are  allowed,  there  will  be  some  additional 
communication  between  the  CP  processes  and  the  associated  normal-new  processes 
(see  section  2.4.3).  Also,  if  the  communication  is  non-zero,  the  communication  cost 
given  above  in  equations  5  and  6  varies  linearly  with  the  distance.  Therefore,  the 
additional  communication  cost  can  be  accounted  for  very  easily  by  associating  it 
with  a  distance  that  is  the  expected  distance  from  the  central  processor  of  the  par¬ 
tition  to  all  other  processors  that  contain  copies  of  the  partition.  This  is  reasonable 
because  the  multiple  copy  uniformity  assumption  dictates  that  multiple  copies  of 
partitions  axe  used  equally  (in  a  probabilistic  sense). 

3.1.5  Processor  Multiplexing  Cost  Function 

Informally,  the  processor  multiplexing  cost  function  PMC  ignores  all  communication 
cost  (i.e.,  assumes  instantaneous  communication)  and  increments  cost  for  every 
*  instance  in  which  two  tasks  could  be  done  in  parallel  but  are  assigned  to  the  same 
processor. 

PMC  is  defined  with  respect  to  a  hypothetical  world  and  not  the  real  world. 
This  hypothetical  world  can  be  defined  in  terms  of  differences  from  the  real  world. 
There  are  two  differences: 
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1.  Zero  communication  delays 

Messages  get  transmitted  instantaneously  in  the  hypothetical  world. 

2.  Infinite  pool  of  virtual  processors  for  each  actual  processor 

When  an  actual  processor  receives  a  message,  it  immediately  assigns  (with 
no  overhead)  a  free  virtual  processor  from  its  pool  to  process  the  message. 
However,  the  processing  of  each  message  by  a  virtual  processor  is  done  in  the 
normal  sequential  manner. 

Given  this  hypothetical  world,  it  is  clearly  possible  to  have  more  than  one  task 
being  executed  at  a  particular  actual  processor  at  any  time.  Let  us  define  the 
processor-load  pUj(t)  of  an  actual  processor  Pj  at  time  t  for  top-level  goal  G{  to  be 
the  number  of  tasks  generated  from  Gi  being  executed  at  Pj  at  time  t.  A  particular 
pli,j(t)  may  look  like  the  curve  in  figure  23.  Also,  excess-processor-load  is  defined  to 
be  the  excess  over  1  of  the  processor  load.  In  other  words, 

eph,j(t)  =  max(0 ,plij(i)  -  1) 

In  figure  23,  epl,j(t)  is  the  value  of  ph,j(t)  over  the  y  =  1  dashed  line.  Since  there 
is  only  one  unit  of  processing  power  available  at  each  processor,  epUtj(t)  represents 
computation  that  must  be  sequentialized.  An  upper  bound  on  the  additional  time 
taken  due  to  this  sequentialization  is  represented  by  the  shaded  area  above  the  y  =  1 
line.  The  upper  bound  is  reached  if  all  the  computation  that  must  be  sequentialized 
is  on  the  critical  path  of  the  computation.  The  sum  of  these  shaded  areas  for  all 
processors  weighted  by  the  top-level  goals  is  defined  to  be  the  processor  multiplexing 
cost.  To  be  more  precise, 

£  f  oo 

PMC(A)  =  numgoal(Gi)  x  ^  /  eplij(t)dt  (7) 

VG.eSG  j=i  ^ 

where  SG  is  the  set  of  top-level  goals,  numgoal(Gi)  is  the  number  associated  with 
top-level  goal  there  are  q  processors  named  P\ . . .  Pq,  and  eplij(t)  is  the  excess- 
processor-load  of  actual  processor  Pj  considering  only  top  level  goal  G<. 
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Figure  23:  Processor  Load  Function 

One  way  to  compute  the  processor  multiplexing  cost  is  to  first  compute  what 
is  necessary  for  any  allocation.  Then,  additional  computation  can  be  done  to  take 
a  specific  allocation  into  account.  In  particular,  processor-load  can  be  computed 
for  each  partition  assuming  it  is  allocated  to  a  processor  separate  from  any  other 
partition.  The  following  allocation-specific  computation  must  be  done  for  each  top- 
level  goal  and  processor.  Processor-loads  of  all  partitions  that  are  allocated  to  a 
particular  processor  Pj  for  a  particular  top-level  goal  G{  must  be  combined  to  get 
plij(t)-  The  “shaded-area”  computation  can  now  be  done  for  each  processor  and 
top  level  goal  combination  and  then  these  can  be  summed  up  according  to  equation 
7. 

In  case  there  are  multiple  copies,  the  processor-load  associated  with  any  partic¬ 
ular  partition  is  assumed  to  be  equally  distributed  over  the  multiple  copies  of  the 
partition  in  question. 


3.2  Strategy  for  Computing  Cost  Function 

To  compute  the  cost  function  exactly  requires  doing  the  run-time  computation  at 
compile-time.  Since  this  is  clearly  senseless,  we  restrict  compile-time  computation 
to  reasoning  about  an  abstraction  of  the  run-time  computation,  an  abstraction  in 
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which  specific  constants  of  the  facts  in  the  database  are  ignored.  The  hope,  of 
course,  is  that  the  approximation  to  the  run-time  computation  is  close  enough  to 
get  meaningful  numbers  from  the  analysis.  In  addition,  it  is  hoped  that  much  less 
computation  needs  to  be  done  to  reason  with  the  abstraction  instead  of  the  real 
run-time  computation. 

Figure  24  illustrates  how  fact  patterns  with  unknown  constants  replace  facts  with 
actual  constants  in  the  database.  Symbols  beginning  with  “uc”  represent  unknown 
constants.  The  crossed  out  facts  are  the  ones  in  the  original  database.  Figure  25 
illustrates  that  using  fact  patterns  reduces  the  number  of  logical  inferences.  Logical 
inferences  enclosed  in  thick  ovals  may  be  collapsed  into  one  logical  inference  at 
compile-time.  In  the  best  case,  the  number  of  logical  inferences  may  be  reduced  by 
an  exponential  factor.  Figure  26  shows  a  conjunctive  goal  with  3  conjuncts.  If  there 
are  n  facts  with  the  a  predicate,  n2  facts  with  the  b  predicate,  and  n3  facts  with 
the  c  predicate,  then  the  number  of  logical  inferences  at  run-time  is  (n  +  n2  +  n3) 
or  0(n3).  In  general,  for  m  conjuncts,  the  number  of  logical  inferences  would  be 
0(nm).  However,  if  the  facts  of  each  predicate  get  represented  by  a  single  compile¬ 
time  fact,  then  the  number  of  logical  inferences  at  compile-time  is  only  3.  In  general, 
for  m  conjuncts,  the  number  of  compile-time  inferences  is  only  0(m).  Therefore, 
the  number  of  logical  inferences  is  reduced  by  an  exponential  factor  from  0(nm)  to 
0(m). 

One  effect  of  using  unknown  constants  is  that  unification  is  now  a  probabilistic 
process.  It  does  not  just  succeed  or  fail;  it  succeeds  with  some  probability.  This 
will  be  discussed  in  more  detail  in  section  3.3. 

Another  computation-saving  technique  used  in  the  cost  computation  procedures 
is  to  separate  out  the  allocation-independent  computation  from  the  allocation- 
dependent  computation.  The  allocation-independent  computation  needs  to  be  per¬ 
formed  only  once  for  the  application.  Only  the  allocation-dependent  computation 
needs  to  be  performed  when  a  specific  allocation  is  being  considered.  In  addition, 
when  an  allocation  is  changed  slightly,  even  the  allocation-specific  computation  need 
not  be  performed  from  scratch.  Useful  state  can  be  saved  between  recomputations 
and  this  can  lead  to  significant  savings. 
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r(X,Y) p(X),q(Y),s(X,Y). 


2  s(uc1  ,uc2) 


Figure  24:  Compile-time  Database 


Figure  25:  Compile-ti 


Computation 
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Figure  26:  Exponential  Savings  at  Compile-time 


In  the  case  of  communication  cost  computation,  the  number  of  messages  and 
the  amount  of  data  between  each  pair  of  partitions  is  independent  of  the  alloca¬ 
tion.  Only,  the  distance  between  partitions  is  dependent  on  the  allocation  being 
considered. 

In  the  case  of  processor  multiplexing  cost  computation,  the  processor-load  func¬ 
tions  associated  with  each  partition  (assuming  that  they  are  allocated  to  separate 
processors)  are  independent  of  the  allocation.  Combining  processor-loads  of  differ¬ 
ent  partitions  does  depend  on  the  allocation.  However,  useful  state  can  be  kept 
between  recomputations  to  save  on  computational  effort  (as  will  be  seen  later  in 
section  3.4). 

An  alternative  to  this  approach  of  estimating  the  amounts  of  communication 
and  processor-loads  at  compile-time  is  to  gather  information  from  one  or  more  runs 
of  the  application  and  collect  this  information  for  use  by  the  cost  computation  pro¬ 
cedures.  One  can  also  think  of  hybrid  approaches  in  which  compile-time  estimates 
may  be  modified  (if  necessary)  by  data  collected  at  run-time.  The  advantage  with 
using  run-time  data  is  that  one  does  not  depend  on  assumptions  that  may  not  be 
totally  accurate  (required  to  do  compile-time  estimation  and  listed  in  section  3.1.2). 
However,  the  disadvantage  is  that  the  estimates  may  get  unduly  influenced  by  the 
last  run  or  last  several  runs.  Also,  making  several  runs  of  the  application  can  be 
much  more  expensive  (depending  on  the  number  of  runs)  than  making  one  run  using 
unknown  constants. 
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3.3  Communication  Cost  Computation 

The  algorithms  described  in  this  section  are  for  computing  the  communication  cost 
for  a  single  top-level  goal.  If  there  are  multiple  top-level  goals,  then  the  algorithms 
need  to  be  repeated  for  each  top-level  goal  and  the  costs  summed.  Also,  if  a  certain 
top-level  goal  is  repeated  multiple  times,  the  communication  cost  is  computed  for  a 
single  instance  and  the  communication  cost  for  the  multiple  instances  is  computed 
by  multiplying  the  single  instance  cost  by  the  repetition  factor. 

The  computation  of  communication  cost  is  done  by  two  algorithms.  The  first 
algorithm  is  c tilled  the  Communication  Estimation  algorithm.  This  algorithm  per¬ 
forms  an  abstract  simulation  of  PM.  A  side-effect  of  the  simulation  is  the  estimation 
of  the  amount  of  communication  (in  total  bytes  and  number  of  messages)  between 
every  pair  of  partitions.  The  second  algorithm  is  called  the  Communication  Cost 
Computation  algorithm.  This  algorithm  takes  the  output  of  the  Communication 
Estimation  algorithm  and  an  allocation  and  computes  the  communication  cost. 

The  Communication  Estimation  algorithm  is  based  on  the  idea  of  simulating 
(at  compile-time)  a  backward-chaining  deduction  using  PM  us  the  execution  model. 
The  difference  from  the  actual  rim- time  computation  is  that  the  compile-time  simu¬ 
lation  is  less  detailed  and,  therefore,  takes  less  time  than  the  run-time  computation. 
Probabilistic  analysis  replaces  some  of  the  detailed  computation  and  most  of  the 
description  of  the  algorithm  focuses  on  this  analysis. 

The  organization  of  this  section  is  as  follows.  Subsection  3.3.1  gives  the  spec¬ 
ifications  of  the  communication  estimation  algorithm.  The  next  three  subsections 
lay  down  the  basis  of  the  probabilistic  analysis.  Subsection  3.3.2  describes  how 
goals  may  be  viewed  as  probabilistic  filters  over  their  solution  domains.  Subsection 

3.3.3  describes  how  the  probability  of  unification  of  two  literals  may  be  estimated 
when  the  exact  constants  in  the  literals  are  not  known  at  compile-time.  Subsection 

3.3.4  describes  how  run-time  messages  must  be  augmented  to  make  them  suitable 
for  the  probabilistic  analysis.  The  next  two  subsections  describe  two  variants  of 
the  Communication  Estimation  algorithm.  Subsection  3.3.5  describes  the  simpler 
variant  that  does  not  deal  with  duplicate  solutions  while  the  next  two  subsections 


82 


CHAPTER  3.  COST  FUNCTION 


show  how  duplicate  solutions  can  be  handled. 

The  Communication  Cost  Computation  algorithm  is  trivial  compared  to  the 
Communication  Estimation  algorithm.  After  the  Communication  Estimation  algo¬ 
rithm  has  produced  an  estimate  of  the  amount  of  communication  between  every 
ordered  pair  of  partitions,  the  communication  cost  algorithm  simply  uses  this  in¬ 
formation  and  equations  5  and  6  to  compute  the  communication  cost.  Since  the 
algorithm  is  so  simple,  it  will  not  be  described  in  any  more  detail. 

Finally,  subsection  3.3.8  discusses  the  complexity  of  both  the  Communication 
Estimation  algorithms  and  the  Communication  Cost  Computation  algorithm. 

3.3.1  Specification  of  Communication  Estimation  Algorithm 

Inputs 

1.  F:  a  set  of  fact  patterns. 

2.  R:  a  set  of  rules. 

3.  G :  a  set  of  goal  patterns. 

4.  P:  a  set  of  subsets  of  .R  U  F  that  are  mutually  exclusive  and  exhaustive.  Each 
member  of  P  is  called  a  partition.  Remember  that  a  constraint  of  PM  is 
that  all  clauses  that  may  be  applied  to  reducing  any  particular  literal  subgoal 
generated  during  the  backward-chaining  deduction  should  be  included  in  pre¬ 
cisely  one  partition.  Note  that  we  are  talking  about  a  single  logical  inference 
here,  not  a  goal  reduction  involving  an  arbitrary  number  of  logical  inferences. 
As  an  example,  there  could  be  one  member  of  P  for  each  set  of  facts  and  rules 
with  a  different  predicate. 

5.  domsize :  a  two  argument  function  that  takes  a  predicate  name  and  a  number 
specifying  a  field  and  returns  the  associated  domain  size.  For  example,  if 
parent(X,Y )  indicates  that  X  is  a  parent  of  Y ,  then  domsize(parent,  1)  =  2 
since  every  person  has  two  parents.  Also,  if  the  average  number  of  children  in 
a  family  is  3,  we  might  say  that  domsize(parent,2)  =  3. 
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3.3  Communication  Cost  Computation 

The  algorithms  described  in  this  section  are  for  computing  the  communication  cost 
for  a  single  top-level  goal.  If  there  are  multiple  top-level  goals,  then  the  algorithms 
need  to  be  repeated  for  each  top-level  goal  and  the  costs  summed.  Also,  if  a  certain 
top-level  goal  is  repeated  multiple  times,  the  communication  cost  is  computed  for  a 
single  instance  and  the  communication  cost  for  the  multiple  instances  is  computed 
by  multiplying  the  single  instance  cost  by  the  repetition  factor. 

The  computation  of  communication  cost  is  done  by  two  algorithms.  The  first 
algorithm  is  called  the  Communication  Estimation  algorithm.  This  algorithm  per¬ 
forms  an  abstract  simulation  of  PM.  A  side-effect  of  the  simulation  is  the  estimation 
of  the  amount  of  communication  (in  total  bytes  and  number  of  messages)  between 
every  pair  of  partitions.  The  second  algorithm  is  called  the  Communication  Cost 
Computation  algorithm.  This  algorithm  takes  the  output  of  the  Communication 
Estimation  algorithm  and  an  allocation  and  computes  the  communication  cost. 

The  Communication  Estimation  algorithm  is  based  on  the  idea  of  simulating 
(at  compile-time)  a  backward-chaining  deduction  using  PM  as  the  execution  model. 
The  difference  from  the  actual  run-time  computation  is  that  the  compile-time  simu¬ 
lation  is  less  detailed  and,  therefore,  takes  less  time  than  the  run-time  computation. 
Probabilistic  analysis  replaces  some  of  the  detailed  computation  and  most  of  the 
description  of  the  algorithm  focuses  on  this  analysis. 

The  organization  of  this  section  is  as  follows.  Subsection  3.3.1  gives  the  spec¬ 
ifications  of  the  communication  estimation  algorithm.  The  next  three  subsections 
lay  down  the  basis  of  the  probabilistic  analysis.  Subsection  3.3.2  describes  how 
goals  may  be  viewed  as  probabilistic  filters  over  their  solution  domains.  Subsection 

3.3.3  describes  how  the  probability  of  unification  of  two  literals  may  be  estimated 
when  the  exact  constants  in  the  literals  are  not  known  at  compile-time.  Subsection 

3.3.4  describes  how  run-time  messages  must  be  augmented  to  make  them  suitable 
for  the  probabilistic  analysis.  The  next  two  subsections  describe  two  variants  of 
the  Communication  Estimation  algorithm.  Subsection  3.3.5  describes  the  simpler 
variant  that  does  not  deal  with  duplicate  solutions  while  the  next  two  subsections 
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Output 

•  C:  a  function  that  takes  two  partitions  Pi  and  P2  and  returns  a  tuple  of 
the  form  <  data,  number  >  where  data  is  the  amount  of  data  (in  bytes)  and 
number  is  the  number  of  messages  sent  from  partition  P2  to  partition  P2.  data 
and  number  are  expected  values  in  a  probabilistic  sense. 


3.3.2  Goals  as  Filters 

Each  goal,  be  it  a  literal  or  a  conjunction  of  literals,  can  be  characterized  as  a  filter 
over  its  solution  domain.  Filter  probability  is  defined  to  be  the  probability  that  a 
random  member  of  the  set  of  possible  solutions  is  a  member  of  the  set  of  actual 
solutions. 

The  cardinality  of  the  domain  of  possible  solutions  (of  a  literal  goal  or  a  con¬ 
junctive  goal),  Np,  is  given  by  the  following  equation: 

np  =  n  (8) 

v;6V 

where  V  is  the  set  of  variables  in  the  goal  and  d{X)  is  the  size  of  the  domain  of 
variable  X.  This  formula  assumes  that  if  the  same  variable  occurs  more  than  once 
in  a  single  conjunct  or  in  more  than  one  conjunct  in  a  conjunctive  goal,  then  its 
domain  is  the  same  for  each  occurrence. 

If  the  number  of  actual  solutions  is  Na,  then  the  filter  probability,  FP ,  is  given 

by 


FP  =  fr  <9> 

By  using  the  literal  independence  assumption,  it  follows  directly  that  the  filter  of 
a  conjunctive  goal  is  equal  to  the  product  of  the  filters  of  the  individual  conjuncts. 
In  other  words,  the  expected  number  of  solutions  N  to  a  conjunctive  goal 

C  =  {C1,C!,...,Cn} 


is  given  by 
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N  =  Npxf[FP(Ci)  (10) 

i=l 

where  Np  is  the  number  of  possible  solutions  and  FP(Ci)  is  the  filter  probability  of 
conjunct  C{.  Plugging  in  the  value  of  Np  from  equation  8,  we  get 

n  =  n  *  n  Fpw  (11) 

«,ev  *= i 

where  V  and  d  are  as  defined  before.  This  is  an  important  equation  because  it 
maies  the  Communication  Estimation  algorithm  particularly  simple  as  will  be  seen 

later. 

As  an  example  of  the  application  of  this  equation,  see  figure  27.  A  3-conjunct 
goal  has  to  be  solved  with  the  database  and  domain  sizes  as  given.  In  this  example, 

Np  =  d(X)  x  d{Y)  x  d(Z)  =  12  x  4  x  2  =  96 

Also,  the  filter  probabilities  of  the  three  conjuncts  can  be  computed  as  follows.  The 
filter  probability  of  “a(X)”  is  its  actual  number  of  solutions  (=  6)  divided  by  its 
potential  number  of  solutions  (=  d(X)  =  12),  which  is  0.5.  The  filter  probability  of 
“b(X,Y)M  is  its  actual  number  of  solutions  (=  24)  divided  by  it  potential  number 
of  solutions  (=  d(X)  x  d(Y)  =  12  x  4  =  48),  which  is  0.5.  The  filter  probability  of 
“c(X,Z)”  is  its  actual  number  of  solutions  (=  8)  divided  by  its  potential  number  of 
solutions  (=  d(X)  x  d(Z)  =  12  x  2  =  24),  which  is  0.33.  Therefore,  using  equation 

11,  we  get 

n  =  n  <*(*’•) x  n  Fp( c<) 

v.-ev 

=  96xf[FP(C'i) 

t=l 

=  96  x  0.5  x  0.5  x  0.33  =  8 

Now,  we  carry  this  analysis  a  step  further.  Each  conjunct  in  a  conjunctive 
goal  may  actually  be  reduced  by  more  than  one  rule  or  by  more  than  one  fact. 
Therefore,  more  than  one  path  of  reasoning  may  lead  to  actual  solutions  for  the 
conjunct.  Each  such  path  of  reasoning,  or  a  set  of  such  paths  of  reasoning  considered 
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Database  Domain  sizes 


6  a(ucl). 

d(X)  =  12 

24  b(uc2,uc3). 

d(Y)  =  4 

8  c(uc4,uc5). 

d(Z)  =  2 

Figure  27:  Predicting  Communication 
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together,  may  lead  to  a  particular  set  of  actual  solutions.  This  particular  set  of 
actual  solutions  will  be  a  subset  of  the  complete  set  of  actual  solutions  but  can 
be  characterized  as  a  filter  nonetheless.  Of  course,  the  filter  probability  associated 
with  a  conjunct,  considering  only  a  subset  of  the  paths  of  reasoning,  will  be  less 
than  or  equal  to  the  filter  probability  associated  with  the  conjunct  when  all  the 
paths  of  reasoning  are  considered.  Due  to  the  literal  independence  assumption,  this 
filter  probability  associated  with  a  conjunctive  goal  will  be  a  product  of  the  filter 
probabilities  associated  with  the  individual  conjuncts  for  the  same  subset  of  the 
actual  solutions. 


3.3.3  Probability  of  Unification 

This  section  describes  how  one  can  compute  the  probability  of  unification  of  two 
literals.  Each  literal  can  contain  unknown  constants.  During  unification,  variables, 
constants  and  unknown  constants  may  be  unified  against  each  other.  A  valid  unify¬ 
ing  substitution  may  contain  bindings  of:  (1)  variables  to  either  variables,  unknown 
constants,  or  constants,  and/or  (2)  unknown  constants  to  either  unknown  constants 
or  constants.  Of  course,  constants  to  be  unified  must  match  exactly. 

Table  1  gives  the  probabilities  of  these  unifications.  In  the  table,  d(uci),  where 
uci  is  an  unknown  constant,  refers  to  the  domain  size  (given  by  the  function 
domsize)  of  the  field  of  the  relation  that  uci  is  associated  with.  The  probabil¬ 
ity  of  unification  of  the  two  literals  is  simply  the  product  of  the  probabilities  of 
unifications  of  the  type  given  in  the  table.  Taking  a  product  is  justified  by  the 
argument  independence  assumption.  Similar  probabilistic  analysis  has  been  used 
before  by  Treitel  in  his  work  on  selecting  the  optimal  mix  of  forward  and  backward 
inference  for  a  sequential  processor  [69]. 

As  an  example,  consider  the  unification  of  the  two  literals  a(ucl,  uc2,uc3)  and 
a(X,X,  cl).  In  this  case,  uci  and  uc2  must  be  unifiable  and  the  probability  of  this 
can  be  found  from  table  1  to  be  ^  =  jJL-.  Notice  that  the  domain  sizes  of  the 
two  unknown  constants  have  been  assumed  to  be  equal.  Also,  uc3  must  be  unifiable 
with  cl  and  the  probability  of  this  can  be  found  from  the  table  to  be  d(uc3)  •  The 
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V2  c2  uc2 

VI 


cl 


ucl 


Table  1:  Probabilities  of  Unification 

probability  of  unification  of  the  two  literals  is  the  product  of  these  two  probabilities. 

As  another  example,  consider  figure  27  again.  Remember  that  back  in  chapter 
2,  we  had  proved  a  theorem  that  stated  that  the  set  of  solutions  produced  by  PM 
is  equal  to  the  set  of  solutions  produced  by  a  Prolog  interpreter.  Of  course,  this 
theorem  also  implies  that  the  cardinalities  of  the  sets  of  solutions  must  be  equal 
in  the  two  cases.  In  section  3.3.2,  we  saw  that  applying  equation  11  had  given  the 
number  of  solutions  of  the  conjunctive  goal  in  figure  27  to  be  8.  Now,  we  can  get  the 
number  of  solutions  by  using  a  total  order  of  conjuncts  as  in  a  Prolog  interpreter 
and  using  table  1  directly  and  show  that  we  get  the  same  number  of  solutions.  In 
particular,  a  Prolog  interpreter  might  use  a  total  order  like  the  one  shown  in  figure 
28.  The  number  of  solutions  of  the  first  conjunct  will  be  6  because  there  are  6 
“a”  facts  in  the  database  and  the  variable  X  in  the  goal  unifies  with  probability 
1  with  ucl  in  the  facts.  Next,  the  variable  X  in  the  second  conjunct  gets  bound 
to  an  unknown  constant.  The  probability  of  unification  of  the  “b”  conjunct  with 
the  “b”  facts  in  the  database  is  the  inverse  of  the  domain  size  of  the  first  field  of 
the  “b”  relation  (=  d(X)  =  12  in  figure  28)  since  an  unknown  constant  is  getting 
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b(X,Y) 


C(X,Z) 


8 


6  24  8 


Database  Domain  sizes 


6  a(ucl). 

d(X)  =  12 

24  b(uc2,uc3). 

d(Y)  =  4 

8  c(uc4,uc5). 

Q. 

N 

II 

ro 

Figure  28:  Estimating  Number  of  Solutions  for  Prolog 


unified  with  another  unknown  constant.  Since  there  are  6  “b”  goals,  24  “b”  facts 
in  the  database,  and  the  probability  of  unification  is  i,  the  expected  number  of 
solutions  of  the  first  two  conjuncts  is  6  x  24  x  &  =  12.  Next,  variable  X  in  the 
“c”  conjunct  gets  bound  to  an  unknown  constant.  The  probability  of  unification  of 
the  “c”  goals  with  the  “c”  facts  is  the  inverse  of  the  domain  size  of  the  first  field 
of  the  “c”  relation  (=  d(X)  =  12)  since  an  unknown  constant  must  be  unified  with 
another  unknown  constant.  Since  there  are  12  “c”  goals,  8  “c”  facts  in  the  database, 
and  the  probability  of  unification  is  £,  the  expected  number  of  solutions  of  all  three 
conjuncts  together  is  12  x  8  x  &  =  8.  This  is  the  same  as  the  number  obtained  by 
applying  equation  11.  This  technique  of  finding  the  expected  number  of  solutions 
of  a  set  of  conjuncts  by  mapping  it  back  repeatedly  to  the  Prolog  case  can  lead  to 
very  inelegant  and  inefficient  algorithms.  Using  equation  11  directly  turns  out  to 
be  much  simpler  (as  will  be  seen  later  in  section  3.3.5). 
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3.3.4  Compile-time  Messages 

As  mentioned  before,  the  communication  estimation  algorithm  is  based  on  the  idea 
of  simulating  (at  compile-time)  a  backward-chaining  deduction  using  PM  as  the  ex¬ 
ecution  model.  The  difference  from  run-time  deduction  is  that  unknown  constants 
may  be  used  at  compile-time.  For  now,  assume  that  a  message  in  PM  contains 
substitutions  only.  The  initial  behavioral  description  of  PM  made  the  same  simpli¬ 
fication.  However,  at  compile-time,  a  message  contains  some  additional  information. 

First,  a  substitution  with  unknown  constants  represents  an  equivalent  class  of 
actual  substitutions  with  actual  constants  only.  Each  member  of  the  equivalent 
class  is  obtained  by  giving  each  unknown  constant  a  value  in  its  domain.  A  compile¬ 
time  message  is  associated  with  a  number  called  the  number  of  substitutions.  This 
number  represents  how  many  instances  of  the  compile-time  message’s  equivalent 
class  are  expected  (in  the  probabilistic  sense)  to  be  sent  on  the  associated  channel  at 
run-time.  Note  that  at  run-time  each  message  is  treated  completely  independently. 
For  example,  when  a  new  conjunct  graph  is  created  to  solve  a  subgoal  at  run-time, 
a  single  message  is  sent  to  the  head  process  of  the  new  conjunct  graph  along  its 
task  channel.  At  compile-time,  if  the  same  message  has  an  associated  number  of 
substitutions  of  IV,  then  N  separate  conjunct  graphs  would  actually  be  generated 
at  run-time,  each  with  its  own  message  on  the  task  channel  to  the  head  process. 

The  advantage  of  using  the  unknown  constant  abstraction  is  that  it  allows  the 
algorithm  to  estimate  communication  cost  without  doing  the  entire  deduction  itself. 
Also,  if  the  goal  for  the  entire  deduction  is  specified  using  unknown  constants,  the 
expected  communication  costs  for  the  entire  class  of  goals  represented  is  computed 
in  one  pass.  In  contrast,  the  run-time  execution  model  can  only  handle  one  specific 
goal  at  a  time. 

Second,  a  compile- time  message  is  associated  with  a  filter  set.  A  filter  set  is  a  set 
of  2-tuples.  There  is  one  such  2-tuple  for  each  literal  that  has  been  processed  so  far 
in  the  conjunct  graph.  Each  tuple  contains:  (1)  a  number  indicating  the  position 
(leftmost  being  1)  of  the  literal  in  the  antecedents  of  the  rule  that  generated  the 
associated  conjunct  graph  and  (2)  the  filter  probability  for  that  literal  that  led  to 
the  set  of  substitutions  described  by  the  compile-time  message. 
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In  addition,  each  message  on  each  channel  in  a  conjunct  graph  contains  the  initial 
number  of  substitutions  for  the  conjunct  graph.  The  initial  number  of  substitutions 
is  the  number  of  substitutions  sent  on  the  task  channel  of  the  head  process  to 
the  complete  conjunct  graph.  A  complete  conjunct  graph  is  defined  to  mean  a  two- 
terminal  DAG  of  processes  that  includes  a  matching  pair  of  Head  and  Tail  processes 
and  all  the  normal  processes  in  between. 

Considering  everything,  a  compile-time  message  is  a  4-tuple: 

<N,S,NI,FS> 

where  N  is  the  number  of  substitutions,  S  is  the  substitution,  NI  is  the  initial 

number  of  substitutions  and  FS  is  the  filter  set.1 

Each  compile-time  message  is  associated  with  a  particular  channel  in  the  dataflow* 
graph  during  simulated  deduction.  The  source  node  and  the  destination  node  of 
the  channel  are  associated  with  one  database  partition  each.  The  compile-time 
message  contributes  to  the  amount  of  communication  between  this  pair  of  database 
partitions.  The  amount  of  data  is 

data(S)  x  N 

where  data(S)  is  the  amount  of  data  (in  bytes,  say)  that  will  be  contained  in  the 
substitution  at  run-time  that  S  represents.  (5  itself  may  contain  unknown  con¬ 
stants  each  of  which  represents  a  known  atomic  constant  at  run-time.)  The  number 
of  messages  to  be  sent  is  N ,  the  number  of  substitutions.  The  total  amount  of 
communication  between  a  pair  of  partitions  is  the  sum  of  contributions  from  each 
message. 

Section  3.3.5  describes  the  algorithm  to  compute  the  amount  of  communication 
between  each  pair  of  partitions  for  a  single  goal.  If  multiple  goals  are  given,  the 
algorithm  must  be  repeated  for  each  goal  and  the  amounts  of  communication  added 
up.  This  algorithm  follows  quite  naturally  from  the  ideas  in  sections  3.3.2,  3.3.3, 
and  3.3.4.  It  may  be  skipped  without  loss  of  continuity  in  the  thesis.  The  interested 
reader  can  return  to  this  section  later  for  more  detail. 


1A  message  at  run-time  is  a  substitution  at  this  level  of  detail. 


3.3.  COMMUNICATION  COST  COMPUTATION 


91 


3.3.5  Communication  Estimation  Algorithm  (No  Duplicates) 

This  section  presents  the  behavioral  description  of  the  simulated  parallel  execution 
model.  The  description  is  similar  in  nature  to  the  behavioral  description  of  PM 
contained  in  section  2.3.2.  The  only  difference  is  that  the  behavioral  description 
of  PM  dealt  with  actual  messages  whereas  simulated  PM  deals  with  compile-time 
messages.  As  explained  before,  the  set  of  compile-time  messages  that  is  generated 
during  simulated  deduction  contains  sufficient  information  to  compute  the  amount 
of  expected  communication  between  each  pair  of  partitions. 

The  description  is  divided  into  four  parts  as  before:  (1)  Sim-CP — the  analog 
of  the  CP  function  of  PM ,  (2)  the  response  of  a  normal  process  to  each  compile¬ 
time  message  on  its  virtual  input  channel,  (3)  the  response  of  a  tail  process  to  each 
compile-time  message  on  its  virtual  input  channel,  and  (4)  the  response  of  a  normal 
process  to  each  compile-time  message  on  each  of  its  subsolution  channels. 

A  running  example  to  make  the  explanations  clearer  is  shown  in  figure  29.  This  is 
the  same  example  as  the  one  that  was  considered  in  chapter  2  except  that  unknown 
constants  have  been  used  for  the  facts.  Also,  the  NF  numbers  to  the  right  of  the 
facts  indicate  how  many  facts  of  that  pattern  are  present  in  the  database.  The 
dataflow*  graph  for  the  simulated  deduction  is  shown  in  figure  30.  Each  compile¬ 
time  message  on  each  channel  is  shown  in  the  figure.  Only  one  compile-time  message 
is  sent  on  each  channel  for  this  example.  This  is  not  true  for  all  cases.  We  will 
assume  for  this  example  that  the  cardinality  of  the  domain  of  each  variable  and 
unknown  constant  is  2. 

3. 3. 5.1  Analog  of  the  CP  function 

As  described  in  chapter  2,  the  function  CP  takes  n  sets  of  substitutions — a  set  for 
each  input  channel  of  a  normal  process — and  returns  a  single  set  of  substitutions. 
The  output  set  of  substitutions  is  the  one  carried  on  the  hypothetical  virtual  input 
channel  of  the  normal  process  in  question.  CP  considers  the  cartesian  product  of 
the  input  sets  and  rejects  all  inconsistent  composite  substitutions  using  an  auxiliary 
function  called  Merge. 
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r(X,Y)  p(X),  q(Y),  s(X,Y) 


p(ucl)  NF=2 

q(Y) m(X),  n(X,Y) 
m(uc2)  NF=2 

n(uc3,uc4)  NF=2 

s(uc5,uc6)  NF=2 


r(X,Y)  is  the  top  level  goal 

Figure  29:  Example  Database 

Sim- CP,  the  simulated  deduction  version  of  CP,  considers  the  cartesian  product 
of  sets  of  compile-time  messages  and  rejects  composite  messages  that  contain  incon¬ 
sistent  substitutions.  The  point  of  departure  from  CP  is  Sim-Merge ,  the  simulated 
deduction  version  of  Merge.  Sim-Merge  must  determine  consistency  of  substitu¬ 
tions  as  before.  However,  in  addition  to  that,  it  must  compute  the  other  fields  of  a 
compile-time  message.  In  particular,  these  fields  are  number  of  substitutions ,  initial 
number  of  substitutions  End  filter  set . 

Let  there  be  n  input  channels  for  the  normal  process  in  question.  Sim-Merge 
takes  one  compile-time  message  from  each  channel  and  either  returns  a  compile- 
time  message  or  1— a  special  element.  The  special  element  is  used  to  indicate 
inconsistent  input  substitutions  just  as  Merge  did.  Let  the  compile-time  message 
on  the  ith  channel  be 

<  N,,Si,NI,FSi  > 

In  case  the  input  compile-time  messages  contain  inconsistent  substitutions,  then 
the  output  of  Sim-Merge  is  ±.  Substitutions  are  inconsistent  if  the  same  variable 
is  bound  to  different  known  constants.  Variables  bound  to  different  unknown  con¬ 
stants  are  not  inconsistent. 

If  the  output  is  not  1,  it  is  a  compile-time  message 

<  N0,S0,NI,FS0> 
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Since  a  filter  set  contains  filter  tuples  for  each  ancestor  conjunct,  we  have 

FS0  =  U  FSi  (12) 

»= 1 

Union  removes  duplicate  filter  tuples.  Duplicates  can  arise  because  the  same  con¬ 
junct  may  be  an  ancestor  from  more  than  one  input  channel. 

Since  we  are  presumably  considering  one  conjunct  graph  (that  may  be  instanti¬ 
ated  a  number  of  times  at  run-time),  all  messages  in  that  conjunct  graph  have  the 
same  initial  number  of  substitutions. 

Equation  11  (in  section  3.3.2)  showed  how  the  expected  number  of  solutions  for 
a  set  of  conjuncts  could  be  computed.  Notice  that  each  compile-time  message  in 
the  set  returned  by  Sim- CP  is  a  solution  of  the  set  of  conjuncts  associated  with 
the  ancestor  processes  of  the  normal  process  being  considered.  Therefore,  for  each 
initial  substitution  for  the  conjunct  graph,  the  expected  number  of  solutions  of 
the  ancestor  conjuncts,  Nint,  that  will  be  generated  by  Sim-Merge  is  given  by  the 
equation: 


N^t  =  n  d(vi) *  n  FPi  (13) 

vj€V  FPi€FPS„ 

where  V  is  the  union  of  the  sets  of  bound  variables  in  the  substitutions  of  the  input 
messages  and  FPS0  is  the  set  of  filter  probabilities  contained  in  the  filter  tuples 
belonging  to  FS0.  For  example,  if 

FSa  —  {<  1,  .5  >,  <  2,  .75  >} 

where  <  1,  .5  >  indicates  that  the  filter  probability  of  the  first  literal  is  .5,  then 

FPS0  =  {.5,  .75} 

Moreover,  the  total  expected  number  of  solutions,  N0,  will  be  given  by  the 
equation: 


N0  =  NI  X  Nint 


(14) 
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Plugging  equation  13  into  equation  14,  we  get 

N0  =  NI  x  JI  d(v i)  x  n  FP‘  <15) 

V,6V  FPiZFPS 

Figure  30  shows  this  computation  for  the  “s(X,Y)”  box.  The  input  compile-time 
messages  axe 

<  2,  {X  =  ucl},  1,  <  1,1  >  > 

and 

<  2,  {XI  =  uc3,  Y  =  uc4},  1,  {<  2, 1  >}  > 

Equation  12  is  used  to  compute  FS0. 

FS0  =  0  FSi  =  {<  1, 1  >,  <  2, 1  >} 

t=i 

Equation  15  is  used  to  compute  Na. 

N0  =  NI  x  n  d(vi)  x  n  Fpi  =  d(X )  x  d(r)  Xlxl  =  2x2  =  4 

v,€V  FPi€FPS 

Since  the  substitutions  are  consistent,  S0  is  obtained  by  taking  the  union  of  the  two 
input  substitutions. 

S0  =  {X  =  ucl,Xl  =  ucZ,Y  =  uc4} 

3. 3. 5. 2  Response  of  Normal  Process  to  Compile-time  Messages  on  Vir¬ 
tual  Input  Channel 

In  real  PM,  each  message  on  a  virtual  input  channel  contains  a  substitution  and 
this  substitution  applied  to  the  literal  associated  with  the  normal  process  represents 
a  goal  to  solve.  Rules  and  facts  associated  with  the  normal  process  are  applied  to 
the  goal  in  an  attempt  to  reduce  or  solve  it.  The  rest  of  this  section  describes  the 
behavior  of  the  process  for  one  of  these  applicable  rules/facts.  The  same  behavior 
is  repeated  for  each  rule/fact. 

Let  the  compile-time  message  on  the  virtual  input  channel  be 


<Ni,Si,NIi,FSi> 
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and  the  compile-time  message  on  the  subtask  channel  be 

<  N0,S0,NI0,FS0> 

The  total  number  of  actual  messages  represented  by  the  input  compile-time 
message  is  Nt.  Some  of  the  associated  goals  will  unify  with  the  literal  representing 
the  head  of  the  rule  or  the  fact.  The  probability  of  unification  can  be  computed 
as  shown  in  section  3.3.3.  Let  PU  be  this  probability.  The  number  of  successful 
unifications,  N0,  in  case  a  rule  is  used  is  given  by  the  equation. 

N0  =  Ni  x  PU  (16) 

In  case  we  are  dealing  with  a  fact  (as  opposed  to  a  rule)  and  the  number  associated 
with  the  fact  is  NF,  then  the  number  of  successful  unifications,  N0,  is  given  by  the 
equation: 

Na  =  N  x  PU  x  NF  (17) 

Also,  NI0  will  be  equal  to  N0.  Sa  =  {}  and  FS0  =  {}  because  no  conjunct  in  the 
new  conjunct  graph  will  have  been  been  solved  as  yet.  The  invocation  substitution 
is  associated  with  the  tail  process  of  the  conjunct  graph  as  described  in  chapter  2. 

The  head  process  of  the  new  conjunct  graph  passes  this  message  unchanged  to 
each  of  its  output  channels. 

As  an  example,  look  in  figure  30  at  the  response  of  the  “s(X,Y)”  process  to  the 
compile-time  message  on  its  virtual  input  channel.  This  message  is 

<  4,{X  =  ucl,Xl  =  uc3,Y  =  uc4},l,{<  1,1  >,<  2,1  >}  > 

Since  the  domain  of  each  variable  X  and  Y  is  2,  the  probability  of  unification,  PU, 
of  the  goal  ws(ucl,uc4)M  with  the  fact  “s(uc5,uc6)”  is 

PU=  1x1  =  0.25 

Therefore,  using  equation  17 

Na  =  N{X  PU  x  NF  =  Ax  0.25  x  2  =  2 

NI0  is  equal  to  N0 . 

£>  =  {} 

FS0  =  {} 
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3.3. 5.3  Response  of  Tail  Process  to  Compile-time  Messages  on  Virtual 
Input  Channel 

Let  the  compile-time  message  on  the  input  channel  be 

<Ni,Si,NIi,FSi> 

and  the  compile-time  message  on  the  task  channel  be 

<  N0,  S0,  NI0,  FS0  > 

Since  the  tail  process  does  not  solve  any  goal  as  such,  N0  =  Nit  NI0  =  NIi,  and 
FSa  —  FSi.  The  only  difference  is  that 

S0  =  Compositional S,  Si) 

where  IS  is  the  invocation  substitution.2 

As  an  example,  look  in  figure  30  at  the  response  of  the  tail  process  below 
wn(Xl,Y)”  to  the  message  on  its  input  channel.  The  message  is 

<2,{},2,{}> 

Notice  that,  in  this  case,  since  there  is  only  one  input  to  the  tail  process,  the  virtual 
input  channel  is  the  same  as  the  input  channel.  The  invocation  substitution  is 
{Y  =  uc4}.  Therefore, 

S0  =  Composition({Y  =  uc4},{})  =  {Y  =  uc4} 

The  rest  of  the  components  of  the  message  on  the  solution  channel  are  the  same  as 
the  ones  for  the  message  on  the  input  channel. 

3.3. 5.4  Response  of  Normal  Process  to  Compile-time  Messages  on  Sub¬ 
solution  Channels 

In  this  case,  the  computation  depends  on  the  compile-time  message  (on  the  virtual 
input  channel  to  the  normal  process)  that  is  associated  with  the  solution  being 
reported  (on  the  subsolution  channel).  Let  this  compile- time  message  be 


2  See  chapter  2  for  a  description  of  invocation  substitution. 
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<  NuSi,NIuFSi  > 

Also,  let  the  compile-time  message  on  the  subsolution  channel  be 

<n2,s2,ni2,fs2> 

and  the  compile-time  message  on  the  output  channel  from  the  normal  process  be 

<  n3,s3,ni3,fs3  > 

First,  since  the  message  on  the  output  channel  is  associated  with  the  same 
conjunct  graph,  we  have 

NI3  =  NIi  (18) 

Second,  just  as  in  real  PM, 

S3  =  Composiiion(S\,  S2)  (19) 

Since  the  total  number  of  messages  on  the  subsolution  channel  must  be  the  same 
as  the  total  number  of  messages  on  the  output  channel  of  the  normal  process, 

n3  =  n2  (2°) 

For  each  real  message  on  the  virtual  input  channel,  the  cardinality  of  the  domain 
of  possible  solutions  is 

n  <*(*<) 

vi€V 

where  V  is  the  set  of  variables  in  the  goal  literal  (i.e.,  the  literal  obtained  by  instan¬ 
tiating  the  literal  associated  with  the  normal  process  with  the  substitution  on  the 
virtual  input  channel).  Therefore,  the  number  of  possible  solutions  for  Ni  messages 

^  x  n  d(v<) 

vi£V 


IS 
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However,  the  actual  number  of  solutions  obtained  is  N3.  Therefore,  the  filter  prob¬ 
ability  ( FP )  associated  with  this  set  of  solutions  is  given  by  the  equation 

FP  - - — -  (21) 

Ni*I 

The  filter  tuple  (FT)  associated  with  this  is  <  n,FP  >,  where  n  is  the  position  of 
the  literal  associated  with  the  process  in  the  antecedents  of  the  rule  that  generated 
the  conjunct  graph.  Therefore, 


FS3  =  FSl  U  {FT} 


As  an  example,  look  in  figure  30  at  process  “n(Xl,Y)”.  In  this  case, 

<  N1,S1>NI1,FS1  >=<  2,  {XI  =  uc2},l,{<  1,1  >}  > 

<  N2,S2,  NI2,  FS2  >=<  2,  {Y  =  uc4},  2,  {}  > 

Therefore,  from  equation  19,  we  have 

S0  =  Composition({Xl  =  uc2},{Y  =  uc4})  =  {.XT  =  uc2,Y  =  uc4} 

From  equation  18,  we  have 

NI3  =  Nh  =  1 


Also,  equation  20  gives  us 

N3  =  N2  =  2 

The  filter  probability  of  the  conjunct  “n(X,Y)”  for  this  compile-time  message  is 
given  by  equation  21 


FP 


N3 


=  0.5 


Ni  x  FL.-er  d(vi )  2x2 

because  the  only  variable  in  the  goal  is  “Y”  and  its  domain  is  2.  Therefore, 


FS0  =  {<  1,1  >,<2,0.5  >} 
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3.3.6  Strategy  for  Dealing  with  Duplicate  Solutions 

Some  rules  can  generate  duplicate  solutions  to  a  goal.  This  can  happen  if  a  variable 
occurs  in  the  tail  of  the  rule  but  not  in  its  head.  For  example,  consider  the  rule 

h(X,Z):-tl(X,Y),t2{Y,Z) 

If  the  subgoal  tl(X,Y)  produced  the  two  solutions  { X  =  3,  Y  —  5}  and  { X  — 
3,Y  =  6}  and  the  subgoal  t2(Y,  Z)  produced  the  two  solutions  {Y  =  5,  Z  =  8}  and 
{Y  =  6,  Z  =  8},  then  {X  -  3,  Z  =  8}  would  appear  twice  as  a  solution  for  h(X,  Z). 
The  communication  estimation  algorithm  presented  so  far  has  to  be  modified  if 
duplicates  of  this  form  are  to  be  considered. 

One  more  piece  of  information — a  duplication  bag — needs  to  be  associated  with 
each  compile- time  message.  The  duplication  bag  associated  with  a  compile-time 
message  includes  the  duplication  factors  of  all  the  conjuncts  that  have  been  pro¬ 
cessed  so  far  in  the  conjunct  graph.  A  duplication  factor  for  any  particular  conjunct 
in  a  conjunctive  goal  is  a  number,  greater  than  or  equal  to  one,  and  is  a  probabilistic 
measure  of  how  many  actual  solutions  are  produced  for  each  unique,  actual  solution 
of  that  conjunct.  Therefore,  if  the  duplication  factor  is  3  and  we  expect  the  total 
number  of  solutions  generated  to  be  5,  then  we  expect  5/3  =  1.67  of  them  (proba¬ 
bilistically)  to  be  unique.  In  the  example  given  above,  the  literal  goal  that  is  solved 
by  the  given  rule  would  have  the  duplication  factor  2.0  associated  with  the  solution 
{X  =  3,  Z  =  8}.  Conjuncts  that  are  ancestors  along  more  than  one  path  will  have 
as  many  copies  of  their  duplication  factors  in  the  duplication  bag.  This  is  the  reason 
a  duplication  bag  is  a  bag  and  not  a  set.  Just  like  a  filter  set,  a  duplication  bag 
contains  2-tuples,  one  for  each  literal  in  the  antecedent  of  the  rule  that  generated 
the  associated  conjunct  graph.  Each  tuple  contains:  (1)  a  number  indicating  the 
position  (leftmost  being  1)  of  the  literal  in  the  antecedents  of  the  rule  and  (2)  the 
duplication  factor  for  the  literal  that  led  to  the  set  of  substitutions  associated  with 
the  compile-time  message. 

Just  as  equation  11  (reproduced  below  as  equation  22)  led  to  a  particularly 
simple  formulation  of  the  Communication  Estimation  algorithm  for  the  no  duplicate 
solutions  case,  there  is  a  similar  equation  for  the  duplicate  solutions  case. 
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II  div^xJlFPiCi)  (22) 

t >i£V  i=l 

The  number  of  solutions  N  for  a  set  of  conjuncts  (taking  duplicates  into  account) 
is  given  by  the  equation  below: 

n=  n  d(v<) *  n  FP(Ci)  x  n  **  (23) 

w,€V  »=1  DFitDFB 

where  V  is  the  set  of  variable  in  the  conjunctive  goal,  d{X)  gives  the  size  of  the 
domain  of  variable  X,  FP{C{)  is  the  filter  probability  of  conjunct  and  DFB 
is  the  duplication  factor  bag  of  the  set  of  conjuncts.  The  duplication  factor  bag  is 
the  bag  of  duplication  factors  associated  with  the  set  of  conjuncts.  The  number  of 
instances  of  each  duplication  factor  in  the  duplication  factor  bag  is  the  number  of 
distinct  paths  from  the  associated  conjunct  to  the  Tail  process  associated  with  the 
conjunctive  goal.  For  example,  in  figure  31,  there  are  2  instances  of  the  duplication 
factor  associated  with  “a”  and  1  instance  each  of  the  duplication  factors  associated 
with  “b”  and  “c”  in  the  duplication  factor  bag  for  the  conjunctive  goal.  If  the 
duplication  factors  of  the  3  conjuncts  a,  b,  and  c  are  DFi,  DF2,  and  DF3  respectively, 
then  the  duplication  factor  bag  for  the  conjunctive  goal  is  lDFuDFuDF2iDF3  1 
In  this  case, 

II  DFi  =  DFf  x  DFi  x  DF3 

DFi€DFB 

In  particular,  if  DF2  =  2,  DF2  =  1,  and  DF3  =  1,  then 

JJ  DFi  =  22  x  1  x  1  =  4 

DFiZDFB 

In  other  words,  four  copies  should  be  expected  (on  the  average)  for  each  unique 
solution  of  the  conjunctive  goal. 

Again,  just  as  in  the  Communication  Estimation  Algorithm  (with  no  duplicates), 
the  algorithm  with  duplicates  follows  naturally  from  the  formulation  of  the  problem 
given  in  this  section.  The  detail  in  section  3.3.7  may  be  skipped  safely  on  the  first 
reading  of  the  dissertation  with  no  loss  of  continuity.  Interested  readers  may  return 
later  for  more  detail. 
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Figure  31:  A  Conjunctive  Goal 

3.3.7  Communication  Estimation  Algorithm  (with  Dupli¬ 
cates) 

The  description  of  the  algorithm  is  divided  into  four  parts  just  as  it  was  done  for 
the  no  duplicate  case.  Also,  only  the  differences  from  the  no  duplicate  case  will  be 
explained. 

3.3.7. 1  Analog  of  the  CP  function 

There  are  n  input  channels  and  the  messages  on  the  channels  are 

<  Ni,Si,NI,FSi,DBi  > 

In  case,  the  messages  contain  inconsistent  substitutions,  then  the  output  is  _L 
as  before.  If  the  output  is  not  -L,  it  is  a  message 

<  N0,S0,NI,FS0,DB0> 

Notice  that  the  initial  number  of  substitutions  is  the  same  for  the  input  messages 
as  the  output  message  because  all  of  them  belong  to  the  same  conjunct  graph. 

As  before, 

FS0  =  U  FSi 

i= i 

n 

DB0  =  @DBi 


However, 
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where  the  symbol  0  denotes  bag  sum. 

The  expected  number  of  solutions  for  each  initial  substitution  of  the  conjunct 
graph,  Nint ,  is  given  by  an  application  of  equation  23: 

Nint  =  n  d(Vi)  x  n  n  df 

FPitFPS  DFiZDFBo 

where  DFB0,  the  duplication  factor  bag,  is  the  bag  of  duplication  factors  in  the 
duplication  tuples  belonging  to  DB0.  For  example,  if  the  duplication  bag  is  IE  < 
1,1.5  >,<  1,1.5  >,<  2,3.5  >  H,  where  1.5  is  the  duplication  factor  for  the  first 
conjunct  and  3.5  is  the  duplication  factor  for  the  second  conjunct,  then  the  dupli¬ 
cation  factor  bag  is  I0..5, 1.5, 3.5  J  Notice  the  slight  variation  from  equation  13  for 
the  no  duplicate  case. 

The  formula  above  gave  the  expected  number  of  solutions  for  each  initial  sub¬ 
stitution.  Therefore,  the  total  expected  number  of  solutions,  N0,  is  given  by 

n0= ni  x  n  d(vi ) x  n  Fpi x  n  df< 

t >i€V  FPiZFPS  DFiZDFB 

3. 3. 7. 2  Response  of  Normal  Process  to  Compile-Time  Messages  on  Vir¬ 
tual  Input  Channel 

Let  the  compile-time  message  on  the  virtual  input  channel  be 

<Ni,Si,NIi,FSi,DBi> 

and  the  compile-time  message  on  the  subtask  channel  be 

<  N0,S0,NI0,FS0,DB)0  > 

In  case,  a  rule  is  used  to  reduce  the  goal, 

N0  =  N{  x  PU 

where  PU  is  the  probability  of  unification  of  the  goal  with  the  head  of  the  rule. 

In  case,  a  fact,  with  an  associated  number  of  N F,  is  used  to  reduce  the  goal, 


N0  =  N,  xPU  xNF 
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where  PU  is  the  probability  of  unification  of  the  goal  with  the  fact.  Again,  we  have 

NI0  =  N0 

Also, 

$>  =  {} 

FS0  =  {} 

DB0  =  O 

IE  ]]  stands  for  an  empty  bag. 

3.3.7. 3  Response  of  Tail  Process  to  Compile-Time  Messages  on  Virtual 
Input  Channel 

Let  the  compile-time  message  on  the  input  channel  be 

<  Ni,Si,NIi,FShDBi  > 

and  the  compile-time  message  on  the  task  channel  be 

<  N0,So,NI0,FS0,DB0> 

As  before, 

N0  =  Ni 

S0  =  ComposHion{IS,  S,) 

NIo  =  NIi 
FS0  =  FSi 

In  addition, 

DB0  =  DBi 
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3.3. 7.4  Response  of  Normal  Process  to  Compile-Time  Message  on  Sub¬ 
solution  Channel 

Let  the  message  on  the  virtual  input  channel  to  the  normal  process  (that  led  to  the 
creation  of  the  subsolution  in  question)  be 

<  N\ ,  Si,  NI\)  FSi,  DB\  > 

Also,  let  the  compile-time  message  on  the  subsolution  channel  be 

<  Nii  ^2>  N I2, FS2,  DB2  > 

and  the  compile-time  message  on  the  output  channel  from  the  normal  process  be 

<  JV3,  S3,  NI3,  FS3,  DB3  > 


As  before, 

N3  =  N2 

S3  =  Composition(S\,  S2) 

NIZ  =  Nh 

The  computation  of  the  filter  probability  and  duplication  factor  for  the  conjunct 
associated  with  the  normal  process  is  somewhat  involved  and  needs  additional  no¬ 
tation.  To  make  the  notation  easier  to  understand,  a  running  example  is  used. 

To  begin  with,  let  FP  be  the  filter  probability  of  the  conjunct  and  DF  be  its 
duplication  factor.  Let  the  literal  associated  with  the  normal  process  be  G  and  the 
rule  used  to  reduce  the  goal  be 

G' :  —SG' 

where  SG'  is  a  set  of  conjuncts. 

As  our  running  example,  consider  the  case  where  the  rule  is  as  given  below. 

h{a,X\Z ', Q ') :  -n(X',Y'),i2(Y',Z') 

Therefore, 

G'  =  h(a,X',Z\Q') 
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and 

SC?  =  {t\{X'X)MY'>Z')} 

Also,  let  G,  the  literal  associated  with  the  normal  process,  be  as  given  below. 

G  =  h(P,X,Z,W) 

Assume  that  the  domains  of  all  the  variables  have  cardinality  2. 

Let  Ivars  be  the  function  that  returns  the  set  of  variables  in  a  literal.  For 
example,  lvars(h(P,X,  Z,W))  =  {P,X,Z,W}.  Also,  let  slvars  be  the  function 
that  returns  the  set  of  variables  in  a  set  of  literals.  For  example, 

slvars({tl{X',  Y7),  V ,  &)})  =  ix' . Y' » 

Notation  related  to  a  goal:  The  goal  to  be  solved  is  G  Is,.  VS\,  the  set  of 
variables  in  the  goal,  is  given  by  the  equation  below. 

VSi  =lvars(G  Is,)  (24) 

For  the  running  example,  let  5j  =  {W  =  6}.  Therefore, 

V Si  =  lvars(G  |s,)  =  lvars(h(P,X,  Z,b))  =  {P,X,Z} 

SDi,  the  cardinality  of  the  domain  of  solutions  of  the  goal,  is  given  by  the 
equation  below. 


For  the  example, 


SDi=  II  d(vi ) 

v.ev'St 


SDi  =  d(P)  x  d(X)  x  d(Z)  =  2x2x2  =  8 


(25) 


Notation  related  to  an  instance  of  a  goal:  An  instance  of  the  goal  G  |s,  that 
unifies  with  the  head  of  the  rule  is  solved.  Let  mgu  represent  the  function  that 
computes  the  most  general  unifier.  Therefore,  the  most  general  unifier  US  of  the 
goal  G  |s,  with  the  head  of  the  rule  G'  is  given  by  the  equation  below. 
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For  the  example, 


US  =  mgu{GuG') 


(26) 


US  =  {P  =  a,X'  =  X,Z'  =  Z,Q'  =  i} 

The  instance  of  the  goal  that  needs  to  be  solved,  is  therefore 

G  |s,  |(75 

For  the  example,  this  goal  instance  is 

h(a,X,Z,b) 

VS2,  the  set  of  variables  in  this  instance  of  the  goal,  is  given  by  the  equation 
below. 


VS2  =  lvars(G  |Sl  |t/s)  (27) 

For  the  example, 

VS2  =  {X,Z} 

9 i  cardinality  of  the  domain  of  solutions  of  the  instance  of  the  goal,  is  given 
by  the  equation  below. 


9  ~  II  d(vi)  (28) 

Vi£V  Si 

For  the  example, 

9  =  d{X)  x  d(Z)  =  2x2  =  4 

As  before,  the  invocation  substitution  IS  is  the  subset  of  the  most  general  unifier 
U S  that  contains  bindings  of  variables  in  the  goal  G  |s,  only  and  not  the  bindings 
of  variables  in  G',  the  head  of  the  rule. 

For  the  example, 


IS  =  {P  =  a} 
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Notation  related  to  a  conjunctive  subgoal:  As  mentioned  before,  the  rule  in 
question  is 


G'  :  -SG' 

SG2,  the  conjunctive  subgoal  that  needs  to  be  solved,  is  given  by  the  equation 
below. 


SG2  =  SG'  \US  (29) 

where  US  is  the  most  general  unifier  of  the  goal  and  the  head  of  the  rule  (as 
given  in  equation  26). 

For  the  example, 

SG2  =  {tl(X,Y'),t2(Y',Z)} 

VS*,  the  set  of  extra  variables  that  are  contained  in  SG2,  the  conjunctive  sub¬ 
goal,  and  not  in  V S2,  the  set  of  variables  in  the  instantiated  goal  (see  equation  27), 
is  given  by  the  equation  below. 


VS4  =  slvars(SG'  It/s)  -VS2  =  slvars(SG'  |t/s)  -  lvars(G  |s,  |t/s)  (30) 

For  the  example, 


vs4  =  {x,r,z}-{x,z}  =  {Y'} 

h,  the  cardinality  of  the  domain  of  these  extra  variables  is  given  by  the  equation 
below. 

h  =  IJ  dfa)  (31) 

vi€VS4 


For  the  example, 


h  =  d(Y')  =  2 
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Computation  of  filter  probability  and  duplication  factor:  Let  DF2  be  the 
compounded  duplication  factor  of  the  conjunct s  of  the  subgoal.  Therefore, 


DF2  =  n  Di  (32) 

where  DFB2  is  the  duplication  factor  bag  associated  with  the  duplication  bag 
DB2  (i.e.,  the  set  of  duplication  factors  in  the  duplication  bag  DB2).  Remember 
that  N2  is  the  totcil  number  of  solutions  being  reported.  N2/NI2  gives  the  number 
of  solutions  for  each  initial  substitution  because  NI2  is  the  initial  number  of  substi¬ 
tutions.  Dividing  N2/NI2  by  DF2  gives  m,  the  number  of  unique  solutions  of  the 
subgoal  for  each  initial  substitution.  In  other  words, 

N2 

NI2  x  DF2 

Now,  m  unique  solutions  in  the  subgoal  solution  domain  (cardinality  =  gxh)  are 
to  be  mapped  into  the  instantiated  goal  domain  (cardinality  =  g).  For  the  example, 
the  solutions  for  the  subgoal  are  distributed  over  the  cross-product  of  the  domains 
of  the  variables  in  the  set  {X,Y',Z}.  These  sire  mapped  into  the  cross-product 
of  the  domains  of  variables  in  the  set  {X,Z}.  The  problem  is  to  find  how  many 
unique  solutions  will  be  obtained  in  the  target  domain.  Since  the  distributions  are 
random,  the  probability  p'  of  a  particular  member  of  the  instantiated  goal  domain 
not  being  one  of  the  solutions  is  given  by 

.  (*T‘) 

gxh\ 

TO  J 

Therefore,  the  probability  p  that  a  particular  member  is  one  of  the  solutions  is 
given  by 


(33) 
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(34) 


As  pointed  out  by  Treitel  [69],  the  analysis  given  above  is  correct,  strictly  speak¬ 
ing,  only  when  m  is  Jin  integer.  Since  the  value  of  m  is  an  expected  value  based 
on  a  probabilistic  analysis,  it  may  not  be  an  integer.  Stirling’s  approximation  for 
binomial  coefficients  can  be  used  to  solve  this  problem. 

There  is  another  problem  that  arises  because  the  analysis  above  assumes  that 
the  value  of  m  is  known  exactly  as  opposed  to  being  an  expected  value.  Since  p 
is  not  a  linear  function  of  m,  the  expected  value  of  p  cannot  be  obtained  simply 
by  using  the  expected  value  of  m  in  equation  34.  This  problem  is  ignored  in  this 
thesis. 

Since  the  probability  of  a  particular  member  of  the  instantiated  goal  domain 
being  a  solution  is  p  and  the  size  of  the  domain  is  <7,  the  expected  number  of  unique 
solutions  in  the  domain  is  p  x  g.  Moreover,  since  the  total  number  of  solutions  is 
m,  the  additional  duplication  factor  due  to  this  mapping  ( DFa )  is  given  by 


DFa  = 


m 


V  x  9 

The  duplication  factor  for  this  solution  to  the  goal,  DF ,  is  given  by  multiplying 
DF2,  the  duplication  factor  for  the  subgoal  solution  (as  given  in  equation  32),  and 
this  additional  factor. 

DF  =  DF2  x  DFa 

The  filter  probability  for  this  solution  to  the  conjunct  is  obtained  by  dividing 
the  number  of  unique  solutions  obtained  (=  jfe)  for  each  goal  (=  ^  -5-  Ni)  by  the 
cardinality  of  the  domain  of  possible  solutions  for  the  goal  ( —  SDi ).  Therefore, 


FP  = 


N2 


Ni  x  SDi  x  DF 
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Algorithm 

Complexity 

Communication 

Estimation 

Up  to  exponential  factor 
less  than  run-time 
computation 

Communication 

Cost 

Computation 

0(p2) 

Communication 

Cost 

Recomputation 

O(p) 

p  =  Number  of  partitions 


Table  2:  Complexity  Results  for  Communication  Cost  Computation 

Now,  DF  and  FP  can  be  worked  into  the  output  message  in  the  2-tuple  format. 
Let  the  position  of  the  conjunct  associated  with  the  normal  process  be  k.  Therefore, 

FS2  =  FS2  U{<  k,FP  >} 

DB3  =  DB2  ©  l<  k,DF>  U 


3.3.8  Complexity 

Complexity  results  are  summarized  in  table  2.  More  explanation  including  the  basis 
for  the  results  is  given  in  the  following  sections  (3.3. 8.1  and  3.3.8. 2). 

3. 3. 8.1  Communication  Estimation  Algorithm 

Using  unknown  constants  in  the  abstract  backward-chaining  deduction  ensures  that 
the  number  of  logical  inferences  in  the  abstract  deduction  is  either  equal  to  or  less 
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than  the  number  of  inferences  when  no  unknown  constants  are  used.  In  the  worst 
case,  no  reduction  takes  place  in  the  number  of  logical  inferences.  In  the  best  case, 
the  number  of  logical  inferences  can  be  reduced  by  an  exponential  factor  (as  was 
seen  earlier  in  section  3.2. 


3.3.8. 2  Communication  Cost  Computation  Algorithm 

If  there  are  p  partitions,  complexity  for  this  computation  is  0(p2)  because  all  pairs 
of  partitions  may  communicate  with  each  other  in  the  worst  case.  In  case  there  are 
multiple  copies,  additional  communication  needs  to  be  accounted  for  as  described 
in  section  3.1.4.  However,  this  only  takes  a  constant  number  of  operations  for  each 
pair  of  partitions  and  therefore  the  complexity  remains  0(p  ). 

If  the  communication  cost  needs  to  be  recomputed  after  a  single  partition  is 
reallocated  to  another  processor,  the  cost  of  the  recomputation  is  O(p)  because  the 
partition  in  question  may  communicate  with  all  other  partitions  in  the  worst  case. 
Again,  the  presence  of  multiple  copies  makes  no  difference  to  the  complexity. 


3.4  Processor  Multiplexing  Cost  Computation 

The  computation  of  processor  multiplexing  cost  (defined  in  section  3.1.5)  is  done  by 
two  algorithms.  The  first  algorithm  is  called  the  Processing  Interval  Assignment 
algorithm.  This  algorithm  performs  an  abstract  simulation  of  PM,  similar  to  the 
one  for  estimating  communication.  A  side-effect  of  the  simulation  is  the  assignment 
of  processing  intervals  for  the  operations  that  need  to  be  performed.  A  processing 
interval  is  a  3-tuple  of  a  start  time,  a  finish  time  and  a  processor  load.  The  second 
algorithm  is  called  the  Processor  Multiplexing  Cost  Computation  algorithm.  This 
algorithm  takes  the  output  of  the  Processing  Interval  Assignment  algorithm  and  an 
allocation  and  computes  the  processor  multiplexing  cost. 
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3.4.1  Cost  Model 

To  estimate  any  cost,  one  needs  a  cost  model.  A  cost  model  specifies  the  cost 
incurred  for  some  set  of  basic  operations.  A  useful  cost  model  is  one  that  picks 
these  basic  operations  such  that  all  operations  that  have  any  associated  cost  must 
be  decomposable  into  these  basic  operations.  This  subsection  presents  a  useful  cost 
model  for  PM. 

The  basic  operations  chosen  with  their  associated  cost  are: 

•  Selecting  the  next  task  to  work  on:  We  assume  that  any  task  that  is  ready  to 
be  executed  may  be  picked.  Cost  assigned  is  0. 

•  Selecting  rules/assertions  to  unify  with  goal:  This  is  essentially  a  database 
indexing  operation.  Cost  is  assumed  to  be  a  constant  Kj. 

•  Plugging  a  substitution  into  a  literal:  Cost  is  Kp. 

•  Doing  a  successful  unification:  Cost  is  Ku. 

•  Doing  an  unsuccessful  unification:  Cost  is  Kpu- 

•  Doing  a  successful  application  of  the  Merge  function:  Cost  assigned  is  0. 

•  Doing  an  unsuccessful  application  of  the  Merge  function:  Cost  assigned  is  0. 

Note  that  the  constants  used  above  are  dependent  on  the  multiprocessor  used. 
These  constants  have  units  of  time  such  as  seconds,  for  example. 

3.4.2  Processing  Interval  Assignment  Algorithm 

This  algorithm  is  split  into  four  parts  just  as  the  Communication  Estimation  algo¬ 
rithm  was.  In  that  algorithm,  the  response  of  processes  to  messages  on  different 
channels  was  described  in  terms  of  their  communication  requirements.  We  now  do 
the  same  in  the  present  algorithm  in  terms  of  processing  requirements. 
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A  compile- time  message  is  augmented  to  include  two  other  pieces  of  information. 
(1)  A  start  time,  ST,  and  (2)  a  finish  time,  FT.  In  its  entirety,  a  compile-time 
message  looks  like: 

<  N,  S,  FS,  DB,  ST,  FT  > 

Fields  other  than  ST  and  FT  have  been  defined  before  in  section  3.3.  It  is  assumed 
that  the  N  actual  messages  that  this  represents  axe  distributed  uniformly  in  time 
from  ST  to  FT.  For  the  top-level  goal,  ST  and  FT  are  both  0.  This  is  interpreted 

to  mean  that  the  top-level  goal  is  given  at  time  0. 

The  basis  for  the  probabilistic  analysis  here  is  the  same  as  that  for  the  Com¬ 
munication  Estimation  algorithm.  All  the  detail  that  follows  now  for  the  four  parts 
of  the  Processing  Interval  Assignment  Algorithm  can  be  safely  skipped  on  the  first 
reading  without  loss  of  continuity.  Interested  readers  can  return  later  for  the  addi¬ 
tional  detail. 

3.4.2. 1  Response  of  Normal  Process  to  Compile-time  Messages  on  Vir¬ 
tual  Input  Channel 

Let  the  compile-time  message  on  the  virtual  input  channel  be 

<  Nin,  Sin,  N liny  FSin ,  DBin,  STin,  FTin  > 

There  are  two  cases  that  need  to  be  considered.  In  the  first  case,  there  are  n 
rules  that  may  be  used  to  reduce  a  goal.  In  the  second  case,  NF  facts  may  be  used 
to  solve  the  goal.  These  cases  are  treated  separately.  If  there  are  both  rules  and 
facts  to  reduce/solve  the  goal,  then  it  is  easy  to  see  how  a  combination  of  the  two 
procedures  may  be  used. 

Case  I:  Rules  only  If  there  are  n  rules  that  may  be  applied  to  the  goal,  then 
n  subtask  channels  will  be  set  up  and  one  message  will  be  sent  on  each.  We  will 
assume  that  unifications  with  the  rules  are  done  in  order  from  1  to  n.  Assume  also 
that  the  probability  of  unification  of  the  goal  with  the  fc’th  rule  is  PUk-  Let  the 
compile-time  message  on  the  Ai’th  subtask  channel  be 

^  N outk  1  South  5  NIoutk ,  F Soutk  >  DBoutk  ?  SToutu  j  FToutk  > 


3.4.  PROCESSOR  MULTIPLEXING  COST  COMPUTATION  115 


Remember  that  the  amounts  of  time  taken  for  plugging  in  a  substitution  into  a 
literal,  for  indexing  the  rules/assertions  to  unify  with  a  goal,  for  a  successful  unifi¬ 
cation,  and  for  an  unsuccessful  unification,  are  Kp,  Kj ,  Kjj  and  Kpu  respectively. 
For  each  actual  message  to  a  normal  process ,  the  substitution  in  the  message  is 
applied  to  the  literal  associated  with  the  process,  all  relevant  rules/assertions  are 
indexed,  and  unifications  are  attempted  between  the  goal  and  the  rules/assertions. 
Assume  that  there  are  n  rules  and  unifications  are  attempted  in  order  starting  with 
the  rule  numbered  1  and  ending  with  the  rule  numbered  n.  Therefore,  A*.,  the  time 
taken  from  the  input  of  an  actual  message  at  a  process  to  the  possible  output  of  a 
message  on  the  k’th  subtask  channel  (corresponding  to  the  k’th  rule)  is  given  by: 

A*  =  KP  +  Kj  +  j^iPUi  x  Ku  +  (1  —  PUi)  x  KFU]  (35) 

»=i 

Therefore,  SToxlii  and  FToutk  are  given  by: 

STouti  =  S‘  Tin  +  Afc  (36) 

FToutk=FTin  +  Ak  (37) 

We  will  now  characterize  the  processing  interval,  <  STj,  FTj,  PLj  >,  associated 
with  the  processor  for  this  computation.  The  start  time ,  STj,  of  the  processing 
interval  is  given  by: 

STj  =  STin  (38) 

The  finish  time ,  FTj,  of  the  processing  interval  is  given  by: 

FTi  =  FToutn  (39) 

Let  PT  be  the  the  total  amount  of  processing  in  time  units  for  this  computation. 

PT  =  Nin  x  An  (40) 

Therefore,  the  processor  load  PLj,  or  the  average  number  of  virtual  processors 
busy  in  the  processing  interval,  is  given  by: 

PT 

FT 'i  -  STi 


PLi  = 


(41) 
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Plugging  in  the  value  of  PT  from  equation  40  into  equation  41,  we  get: 

<42) 

Of  course,  all  other  fields  of  the  output  compile-time  messages  can  be  computed 
using  the  Communication  Estimation  algorithm. 

Case  II:  Facts  only  In  this  case,  N F  facts  are  available  for  attempting  to  solve 
the  goal.  The  compile-time  message  on  the  virtual  input  channel  is 

<  Nin,Sin,NIin,FSin,DBin,STin,FTin  > 

as  before.  In  this  case,  only  one  subtask  channel  is  set  up  since  the  whole  set  of 
NF  facts  is  considered  in  one  pass  (because  they  are  included  in  one  fact  pattern). 
Let  the  compile-time  message  on  the  subt ask-channel  be 

^  Nout  J  Sout)  N lout,  FSout^DBout,  STout,  FTout  > 

As  before,  Afc,  the  time  taken  from  the  input  of  an  actual  message  at  a  process 
to  the  (possible)  output  of  a  message  on  the  subtask  channel  is  given  by: 

k 

A*,  =  KP  +  Ki  +  \ ~2[PU  x  Ku  +  (1  -  PU )  x  KFu]  (43) 

i=l 

Since  only  one  subtask  channel  is  set  up,  STout  in  this  case  is  the  minimum 
of  the  STouti ’s  in  the  previous  case  (with  rules  only)  and  FTout  in  this  case  is  the 
maximum  of  the  FToutt  in  the  previous  case.  Therefore, 

ST^  =  STin  +  Ai  (44) 

FTout  =  FTin  +  Anf  (45) 

We  will  now  characterize  the  processing  interval,  <  STj,  FTj,  PLj  >,  associated 
with  the  processor  for  this  computation.  The  start  time,  STj,  of  the  processing 
interval  is  given  by: 

STi  =  ST^ 


(46) 
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The  finish  time ,  FTj,  of  the  processing  interval  is  given  by: 

FTj  =  FT^t  (47) 

Therefore,  the  processor  load  PLi,  or  the  average  number  of  virtual  processors 
busy  in  the  processing  interval,  is  given  by: 

PL,  =  N'"  *  A"f  (48) 

FT/  -  52/  v  ’ 

3.4.2. 2  Response  of  Tail  Process  to  Compile-time  Messages  on  Virtual 
Input  Channel 

Since  no  basic  operations  are  included  in  this  computation,  no  cost  is  incurred.  A 
message  on  the  virtual  input  channel  produces  a  message  on  the  solution  channel 
immediately  with  no  time  delay. 

3. 4.2.3  Response  of  Normal  Process  to  Compile-time  Messages  on  Sub¬ 
solution  Channels 

Again,  no  basic  operations  are  included  and,  therefore,  the  computation  is  free. 

3.4.2. 4  Analog  of  the  CP  function 

Sim-Merge,  the  function  described  in  the  Communication  Estimation  algorithm, 
must  be  augmented  further.  The  additional  computation  to  be  performed  by  Sim- 
Merge  is  described  below. 

Let  there  be  n  input  channels  and  the  messages  on  the  channels  be 

<  Ni,  Si,  NI,  FSi,  DBi,  STi,FTi  > 

In  case,  the  messages  contain  inconsistent  substitutions,  then  the  output  is  _L 
a s  before.  If  the  output  is  not  _L,  it  is  a  message 

<  N0,S0,NI,FS0,DB0,ST0,FT0  > 

In  this  computation,  no  basic  operations  are  included.  However,  one  still  has  to 
assign  a  start  time,  ST0,  and  a  finish  time,  FT0,  to  the  output  message.  ST0  is  the 
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earliest  possible  time  that  an  actual  message  associated  with  the  output  compile¬ 
time  message  is  sent  out.  Notice  that  there  must  be  at  least  one  actual  message 
on  each  of  the  input  channels  to  produce  an  actual  message  on  the  virtual  input 
channel.  Therefore, 

ST0  =  max  STi  (49) 

t=t 

Similarly,  FT0  is  the  latest  possible  time  that  an  actual  message  associated  with 
the  output  compile-time  message  is  sent  out.  Therefore, 

FT0  =  max  FT{  (50) 

i=l 

A  uniformity  assumption  has  been  made  here  that  all  actual  messages  associated 
with  the  output  compile-time  message  are  uniformly  distributed  over  this  interval 
from  ST0  to  FT0. 


3.4.3  Processor  Multiplexing  Cost  Computation  Algorithm 

The  algorithm  will  be  referred  to  by  its  abbreviated  name  PMCCA.  The  input  is 
a  set  of  sets  of  processing  intervals — one  set  for  each  processor  that  will  be  used  at 
run-time.  The  output  is  a  number  that  represents  the  processor  multiplexing  cost 
for  the  multiprocessor. 

For  this  algorithm,  processing  intervals  are  represented  in  a  different  manner 
than  before.  Each  processing  interval  is  represented  as  two  elements,  one  for  each 
end-point  of  the  interval.  Some  additional  information  is  also  included  in  each 
element.  In  all,  an  element  is  a  5-tuple  with  the  following  fields: 

1.  Type:  This  is  either  start  or  finish  depending  on  whether  this  element  repre¬ 
sents  the  start  end-point  or  finish  end-point  for  the  interval  in  question. 

2.  Time:  This  is  the  time  associated  with  the  start  or  finish  end-point  of  the 
interval  in  question. 

3.  Load:  This  is  the  processor  load  associated  with  the  processing  interval. 
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4.  CLoad:  This  is  the  cumulative  load  of  all  intervals  that  overlap  at  this  instant 
in  time. 

This  algorithm  uses  an  auxiliary  procedure  PMCCA-1  that  takes  the  set  of 
processing  intervals  associated  with  a  single  processor  and  returns  the  processor 
multiplexing  cost  for  that  processor  only.3  After  this  auxiliary  procedure  is  run  on 
each  processor,  the  sum  of  all  the  individual  processor  multiplexing  costs  gives  the 
processor  multiplexing  cost  for  the  multiprocessor. 

The  procedure  PMCCA-1  uses  an  abstract  data  structure  that  we  will  call  a  PQ- 
list.  The  name  suggests  the  similarity  of  the  data  structure  to  both  priority-queues 
[3]  and  sorted  lists.  The  elements  are  maintained  in  a  2-3  tree  [3],  for  example, 
to  get  log  performance  for  insertions  and  deletions.  They  are  also  maintained  in  a 
sorted  list  (in  increasing  order).  This  is  easy  because  2-3  trees  have  leaves  in  sorted 
order  anyway  from  left  to  right.  In  all,  the  data  structure  supports  the  following 
abstract  operations: 

1.  InsertPQL(PQL,  element,  key):  This  inserts  the  element  element  into  the 
PQ-list  PQL  in  log  time.  In  addition,  the  CLoad  field  of  the  element  is  set 
to  the  CLoad  field  of  the  previous  element  (in  sorted  order).  If  there  is  no 
previous  element,  then  the  field  is  set  to  zero. 

2.  DeletePQL(PQL,  element):  This  deletes  element  from  PQL  in  log  time. 

3.  EnumeratePQL(PQL,  elementl,  elements):  Enumerates  all  elements  in  PQL 
in  sorted  order  from  the  element  elementl  to  the  element  elements.  This  is 
done  in  time  linear  in  the  number  of  elements  enumerated. 

A  detailed  description  of  the  procedure  PMCCA-1  is  given  in  appendix  B.  How¬ 
ever,  a  rough  description  will  be  given  here.  Each  processor  has  an  associated  PQ- 
list  and  a  variable  PMC.  PMC  is  the  current  value  of  the  processor  multiplexing  cost 
for  the  current  set  of  processing  intervals  in  the  PQ-list.  PMC  is  zero  initially  when 
there  are  no  elements  in  the  PQ-list.  Each  processing  interval  is  inserted  into  the 

3Notice  that  processor  multiplexing  cost  is  defined  for  a  multiprocessor  but  a  single  processor  is 
just  a  special  case  of  a  multiprocessor. 
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PQ-list  as  two  elements  using  a  procedure  called  InsertPI  (for  Insert  Processing  In¬ 
terval).  When  elements  are  inserted  into  the  PQ-list ,  the  data-structure  itself  must 
be  modified  as  necessary  (by  using  the  abstract  operation  InsertPQL  for  PQ-lists). 
In  addition,  the  CLoad  fields  of  all  elements  whose  Time  fields  fall  within  the  time 
interval  of  the  processing  interval  may  have  to  be  modified.  After  all  processing 
intervals  have  been  inserted  into  the  PQ-list,  the  value  of  PMC  is  the  processor 
multiplexing  cost  for  the  processor. 

Another  procedure  called  DeletePI  is  used  to  remove  processing  intervals  from 
the  PQ-list.  DeletePI  is  not  used  if  the  processor  multiplexing  cost  is  to  be  computed 
once  only  for  a  particular  allocation.  However,  it  is  useful  when  more  than  one 
allocation  needs  to  be  considered.  The  next  chapter  will  demonstrate  this  need. 

Both  InsertPI  and  DeletePI  do  a  constant  number  of  operations  at  most  for  every 
element  in  the  PQ-list  that  lies  between  the  two  end-points  (in  sorted  order).  This 
can  be  verified  by  looking  at  the  detailed  description  in  appendix  B.  In  the  worst 
case,  this  set  of  elements  could  include  every  element  in  the  PQ-list.  In  addition, 
the  associated  InsertPQL  and  DeletePQL  operations  take  log  time  (in  the  number 
of  elements  in  the  PQ-list).  Therefore,  the  total  time  complexity  of  both  InsertPI 
and  DeletePI  is  O(n),  where  n  is  the  number  of  elements  in  the  PQ-list. 


3.4.4  Complexity 

The  complexity  results  for  the  processor  interval  assignment  algorithm  and  the 
processor  multiplexing  cost  computation  algorithm  are  summarized  in  table  3.  More 
explanation  including  the  basis  for  the  results  is  given  in  the  following  sections 
(3.4.4.1  and  3.4.4.2). 

3.4.4. 1  Processor  Interval  Assignment  Algorithm 

Just  as  in  the  case  of  the  communication  estimation  algorithm,  an  abstract  backward¬ 
chaining  deduction  is  done  using  unknown  constants  as  the  abstraction.  Also,  the 
number  of  additional  operations  for  each  logical  inference  is  constant.  Therefore,  the 
complexity  of  this  algorithm  is  the  same  as  that  for  the  communication  estimation 
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Algorithm 

Complexity 

Processor 

Interval 

Assignment 

Up  to  exponential  factor 
less  than  run-time 
computation 

Processor 

Multiplexing 

Cost 

Computation 

0(q  r2 ) 

Processor 

Multiplexing 

Cost 

Recomputation 

0(q  r2 ) 

q  =  Number  of  processors 
r  =  Number  of  subgoals  at  compile-time 


Table  3:  Complexity  Results  for  Processor  Multiplexing  Cost  Computation 
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algorithm. 

In  fact,  both  the  communication  estimation  algorithm  and  the  processor  inter¬ 
val  assignment  algorithm  can  be  performed  concurrently  using  just  one  abstract 
backward-chaining  deduction.  This  is,  in  fact,  how  they  are  implemented.  Al¬ 
though,  this  cost  savings  is  important  for  an  implementation,  complexity  results 
remain  the  same  whether  the  two  algorithms  are  performed  concurrently  or  sepa¬ 
rately. 


3.4.4. 2  Processor  Multiplexing  Cost  Computation  Algorithm 

Assume  at  first  that  no  multiple  copies  are  allowed  for  partitions.  Let  r  be  the 
number  of  subgoals  generated  during  the  abstract  backward-chaining  deduction  of 
the  Processor  Interval  Assignment  Algorithm.  Each  subgoal  will  have  an  associated 
processing  interval.  In  the  worst  case,  all  subgoals  may  be  allocated  to  the  same  pro¬ 
cessor  and,  therefore,  the  same  PQ-list.  As  mentioned  before,  InsertPI and  DeletePI 
take  0{n )  time,  where  n  is  the  number  of  elements  in  the  PQ-list.  Therefore,  the 
time  taken  for  the  combined  set  of  InsertPIs  to  compute  the  processor  multiplexing 
cost  initially  is  0(r2). 

Now,  consider  the  case  with  multiple  copies.  Let  q  be  the  number  of  processors 
in  the  system.  Now,  the  maximum  number  of  copies  possible  for  any  partition  is 
q.  If  the  number  of  copies  of  a  certain  partition  is  m,  then  each  of  its  processing 
intervals  in  the  single  copy  case  is  now  modelled  as  m  processing  intervals,  each 
with  A  of  the  original  processor  load.  The  time  taken  for  this  algorithm  is  the  most 
when  all  r  processing  intervals  have  q  copies  associated,  one  in  each  processor.  The 
combined  set  of  InsertPIs  will  take  0(qr2)  time.  Note  that  it  is  not  0(?V)  because 
any  single  processor  can  only  contain  one  copy  of  a  partition. 

Now,  let  us  say  that  reallocations  axe  allowed  and  a  single  partition  may  be 
reallocated  to  another  processor.  Processor  multiplexing  cost  may  be  computed 
by  using  DeletePIs  on  the  associated  processing  intervals  to  remove  them  from 
the  original  processor’s  PQ-list  and  then  applying  InsertPIs  to  insert  the  same 
processing  intervals  into  the  new  processor’s  PQ-list.  Since  the  partition  in  question 
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may  include  all  the  subgoals  in  the  worst  case,  the  cost  of  recomputation  is  0(qr2) 
the  same  as  the  worst  case  cost  for  the  original  computation.  In  the  typical  case, 
however,  one  would  hope  to  do  a  lot  better  than  this. 


3.5  Summary 

This  chapter  has  presented  the  formal  definition  for  the  cost  function  that  is  the 
basis  for  allocation.  The  cost  function  relates  well  to  intuitive  notions  of  the  quality 
of  allocations.  One  way  to  view  the  cost  function  is  that  it  treats  all  communication 
delays  and  delays  due  to  sequentialization  of  parallel  tasks  as  being  on  the  critical 
path  of  the  computation  in  the  worst  case.  Since  the  parallel  time  for  execution 
is  the  same  for  all  allocations,  it  is  the  extra  delay  due  to  communication  and 
sequentialization  that  should  be  used  (and  is  used)  as  the  cost  function  to  compare 
different  allocations. 

An  important  feature  of  the  cost  function  is  that  it  is  efficient  to  compute  and 
recompute.  Algorithms  were  presented  to  do  this  computation  and  recomputation. 

The  cost  function  ignores  two  aspects  of  allocations  that  should  be  included  in 
a  future  allocator,  if  possible.  First,  as  mentioned  above,  all  delays  and  sequen- 
tializations  of  parallel  tasks  are  considered  to  be  on  the  critical  path.  It  would  be 
better  to  work  without  this  assumption.  Second,  the  communication  delay  function 
does  not  take  congestion  of  communication  channels  into  account.  Despite  these 
two  simplifications,  the  cost  function  serves  as  a  good  basis  for  an  allocator  as  the 
next  chapter  will  show. 
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Chapter  4 

Allocation  Algorithms 


This  chapter  describes  the  algorithms  used  by  the  allocator  to  perform  a  limited 
search  of  the  space  of  allocations.  In  addition,  the  chapter  includes  experimental 
results  obtained  from  an  implementation  of  the  allocator  and  PM. 

There  axe  two  main  algorithms  for  searching  the  space  of  allocations.  Both 
use  the  cost  function  and  associated  algorithms  described  in  the  previous  chapter. 
The  first  algorithm  is  a  greedy  algorithm  in  which  partitions  are  allocated  one  at 
a  time.  A  partition  is  allocated  to  the  lowest  cost  processor  without  re-allocating 
any  partitions  that  were  allocated  previously.  The  second  algorithm  is  a  local 
minimization  algorithm.  This  algorithm  consists  of  a  sequence  of  cost-reducing 
re-allocations  of  partitions  to  neighboring  processors. 

Both  allocation  algorithms  are  described  in  detail  next  followed  by  experimental 
results.  Some  related  work  is  also  discussed  at  the  end  of  the  chapter. 


4.1  Greedy  Allocation 

This  section  contains  the  specifications  of  the  algorithm,  a  description  of  the  al¬ 
gorithm,  a  discussion  of  its  complexity,  and  an  example  to  show  that  it  does  not 
necessarily  produce  a  locally  optimal  solution.  However,  the  section  on  experimen¬ 
tal  results  will  show  later  that,  in  a  typical  case,  greedy  allocation  can  produce  good 
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allocations  by  itself. 

4.1.1  Specifications 

Inputs 

1.  P:  a  set  of  partitions  of  the  database. 

2.  C:  a  function  that  takes  two  partitions  Px  and  P2  and  returns  a  tuple  of 
the  form  <  data,  number  >  where  data  is  the  amount  of  data  (in  bytes)  and 
number  is  the  number  of  messages  sent  from  partition  Pi  to  partition  P2.  data 
and  number  are  expected  values  in  a  probabilistic  sense. 

3.  PI:  a  function  that  takes  a  partition  and  returns  the  set  of  processing  intervals 
associated  with  the  partition. 

4.  Multiprocessor  constants:  These  are  Ku  K2,  and  K3  used  to  compute  com¬ 
munication  cost  as  given  by  equations  5  and  6. 

5.  Topology:  This  includes  (1)  distances  between  all  pairs  of  processors  and  (2) 
lists  of  neighbors  of  each  processor. 

Outputs 

1.  Allocation:  A  many-to-one  mapping  from  the  set  of  partitions  to  the  set  of 
processors. 

2.  Number  of  copies  for  each  partition:  If  the  number  of  copies  is  greater  than 
1,  then  the  allocation  above  specifies  the  central  processor  for  the  cluster  of 
copies.  The  number  of  copies  will  determine  the  processors  around  the  central 
processor  that  will  also  contain  copies  of  the  partition. 

4.1.2  Algorithm 

Let  us  assume  for  now  that  each  partition  has  a  single  copy.  The  extensions  to 
handle  multiple  copies  will  be  described  later  in  this  section. 
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The  overall  structure  of  the  greedy  allocation  algorithm  is  as  follows:  Starting 
from  an  empty  allocation,  each  partition  is  allocated  one  at  a  time.  The  single 
partition  under  consideration  at  any  time  is  allocated  to  the  processor  that  leads 
to  the  lowest  cost.  After  a  partition  is  allocated,  it  is  not  reallocated  to  another 
processor. 

This  algorithm  is  embodied  in  the  procedure  Greedy  Allocation  shown  in  figure 
32.  As  shown  in  the  figure,  all  inputs  to  the  procedure  sire  implicit.  These  inputs 
were  described  in  the  specifications  to  the  algorithm  given  above.  At  the  beginning 
of  each  iteration  of  the  outer  For  loop,  there  is  a  partial  allocation  of  some  partitions 
to  processors.  Each  iteration  allocates  the  next  partition  to  the  processor  that  leads 
to  the  lowest  cost.  The  inner  For  loop  considers  allocation  of  the  partition  to  each 
processor  in  turn.  The  code  segment  “Allocate  Partition  to  Processor ”  includes  (1) 
the  application  of  the  procedure  InsertPI  from  chapter  3  to  each  processing  interval 
associated  with  the  partition  Partition — with  the  second  argument  of  the  cadi  to 
InsertPI  being  the  PQ-list  associated  with  the  processor  Processor ,  (2)  the  update  of 
the  cost  function  due  to  the  additional  communication  to/from  the  partition  from/to 
those  already  allocated,  and  (3)  the  update  of  the  state  of  allocation  reflecting  that 
the  partition  has  been  allocated  to  the  processor.  The  code  segment  “Deallocate 
Partition  from  Processor”  includes  the  opposite  operations. 

Multiple  copies  can  be  handled  in  a  couple  of  ways.  One  method  is  more  prin¬ 
cipled  as  well  as  more  costly  than  the  other  one.  I  will  describe  this  first.  The  only 
change  required  from  the  procedure  Greedy  Alio  cation  is  that  the  inner  For  loop 
needs  to  be  changed  as  follows. 

“For  all  Processor  €  PotentialProcs  do  begin” 
needs  to  be  changed  to 

“For  all  combinations  of  Processor  €  PotentialProcs  and 

NumCopies=l. ..  Cardinality(AllProcs)  do  begin” 

and 

“Allocate  Partition  to  Processor ” 
needs  to  be  changed  to 
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Procedure  Greedy Allocation() 
begin 

PotentialProcs  *—  AllProcs ; 

/*  AllProcs  is  the  set  of  all  processors  */ 

For  all  Partition  €  S etO  f  P artitions  do  begin 
/*  SetOfPartitions  is  the  set  of  all  partitions  */ 
BestCost  <—  oo; 

BestProc «—  nil; 

For  all  Processor  6  PotentialProcs  do  begin 
Allocate  Partition  to  Processor; 
TempCost  <—  Cost(Allocation); 

If  Tempcost  <  BestCost  then  begin 
BestCost  *—  TempCost; 

BestProc  *—  Processor 
end;  /*  If  */ 

Deallocate  Partition  from  Processor 
end;  /*  For  */ 

Allocate  Partition  to  BestProc 
end  /*  For  */ 

end;  /*  Greedy  Allocation  */ 


Figure  32:  Procedure  Greedy  Alio  cation 
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“Allocate  NumCopies  copies  of  Partition  to  Processor .” 

This  last  statement  is  interpreted  to  mean  that  the  central  processor  of  the 
cluster  of  the  NumCopies  copies  should  be  Processor.  Call  this  modified  version  of 
the  procedure  Greedy  Allocation! . 

Another  way  to  handle  multiple  copies  is  to  decide  the  numbers  of  copies  of 
all  partitions  prior  to  using  the  procedure  Greedy  Allocation.  The  numbers  can 
be  picked  heuristically.  One  reasonable  way  to  pick  the  number  of  copies  of  a 
partition  is  to  take  the  highest  degree  of  parallelism  exhibited  by  the  partition. 
The  highest  degree  of  parallelism  is  simply  the  maximum  of  the  processor-load 
function  associated  with  the  partition  as  described  in  chapter  3.  Since  the  number 
of  copies  cannot  be  any  arbitrary  number,  and  certainly  not  a  fractional  number, 
the  number  of  copies  is  picked  arbitrarily  to  be  the  next  higher  acceptable  number 
greater  than  the  maximum  degree  of  parallelism.  Call  this  modified  version  of 
the  procedure  Greedy  Allocation" .  This  method  is  less  expensive  than  the  first  one. 
Actual  complexities  of  the  two  methods  will  be  compared  in  the  next  section. 

Notice  that  in  the  code  for  the  procedure  Greedy  Alio  cation  shown  in  figure  32, 
no  mention  was  made  of  the  order  in  which  partitions  are  chosen  for  allocation 
out  of  the  set  SetOf Partitions.  In  practice,  the  order  of  allocation  can  affect  the 
allocation  chosen  by  the  procedure.  The  order  that  is  used  in  this  thesis  is  the 
topological  order  associated  with  the  dataflow*  graph  of  the  computation.  If  a 
partition  occurs  multiple  times  in  a  topological  search,  its  first  instance  is  chosen 
for  the  ordering.  This  order  of  allocation  ensures  that  partitions  are  allocated  only 
after  previously  used  partitions  in  the  dataflow*  graph  have  been  allocated,  thereby 
giving  the  greedy  allocation  procedure  some  context  in  which  to  make  reasonable 
decisions.  Prior  to  using  the  topological  ordering,  a  random  ordering  was  used  and 
discarded  because  it  would  make  bad  allocations  for  partitions  that  did  not  have 
any  communicating  partitions  allocated  before  it. 

In  the  special  case  when  communication  delays  are  assumed  to  be  zero,  there  is 
an  even  more  effective  order  of  allocation.  In  particular,  Graham  [29]  has  shown  that 
a  particular  order  gives  an  upper  bound  on  completion  time  of  twice  the  optimal 
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Algorithm 

Complexity 

GreedyAllocation 

2  2 

0(p  q+pqr  ) 

GreedyAllocation ' 

2  2  3  2 

0(p  q  +pq  r  ) 

GreedyAllocation" 

2  2  2 

0(p  q+pq  r  ) 

p  =  Number  of  partitions 

q  =  Number  of  processors 

r  =  Number  of  subgoals  at  compile-time 

Table  4:  Complexity  Results  for  Greedy  Allocation 

completion  time  (asymptotically  when  the  number  of  processors  goes  to  infinity). 
In  this  ordering,  the  next  task  chosen  for  execution  at  any  time  out  of  a  DAG  of 
tasks  is  always  the  one  that  Kheads  the  longest  chain  of  unexecuted  tasks  (in  the 
sense  that  the  sum  of  the  task  times  in  the  chain  is  maximal).”  Unfortunately,  this 
result  does  not  apply  to  the  case  where  communication  delays  are  non-zero. 


4.1.3  Complexity 

The  complexity  results  for  greedy  allocation  are  summarized  in  table  4.  Further 
explanation  including  the  basis  of  the  results  is  given  below. 

Let  p  be  the  number  of  partitions,  q  the  number  of  processors,  and  r  the  num¬ 
ber  of  subgoals  in  the  dataflow*  graph  generated  during  abstract  backward-chaining 
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deduction  for  the  Processing  Interval  Assignment  algorithm  as  well  as  the  Commu¬ 
nication  Estimation  algorithm. 

The  time  to  update  the  communication  cost  function  when  a  single  partition  is 
allocated  to  a  single  processor  is  p  (see  section  3.3.8).  The  time  to  update  processor 
multiplexing  cost  when  a  single  partition  is  allocated  is  r2  when  only  single  copies 
are  allowed  and  it  is  qr2  when  multiple  copies  are  allowed  (see  section  3.4.4.2).  The 
combined  cost  is  0(p  -I-  r2)  for  single  copies  and  0(p  +  qr2)  for  multiple  copies. 
Deallocation  leads  to  the  same  cost  and,  therefore,  the  order  of  complexity  for  a 
combined  allocation  and  deallocation  is  the  same  as  simply  an  allocation. 

The  outer  loop  is  executed  p  times — once  for  each  partition.  For  Greedy  Al¬ 
location  as  well  as  Greedy  Allocation" ,  the  inner  loop  is  executed  q  times.  For 
Greedy  Allocation' ,  the  inner  loop  is  executed  q2  times  since  the  cardinality  of  Po- 
tentialProcs  in  the  worst  case  (q)  multiplied  by  the  cardinality  of  AllProcs  (5)  is 
q2.  Therefore,  Greedy  AUo  cation  and  Greedy  Alio  cation!'  require  pq  updates  due  to 
allocations  and  Greedy  Allocation!  requires  up  to  pq2  updates  due  to  allocations. 
Multiplying  the  number  of  updates  by  the  complexity  of  each  update  gives  the 
complexity  of  the  entire  algorithm.  Therefore,  the  complexity  of  GreedyAlloca- 
tion  is  0(pq  x  (p  +  r2)),  which  is  0(p2q  +  pqr2).  Similarly,  the  complexity  of 
Greedy  Allocation!'  is  0(pq  x  (p  +  qr2)),  which  is  0(p2q  +  pq2r2).  Finally,  the  com¬ 
plexity  of  Greedy  Alio  cation'  is  0(pq2  x  (p  +  qr2)),  which  is  0(p2q2  +pq3r2). 

An  optimization  is  possible  for  the  greedy  allocation  procedures  that  can  reduce 
the  absolute  cost  of  the  procedures  but  does  not  affect  the  worst  case  complexity 
measures  derived  above.  PotentialProcs  in  the  procedures  need  not  be  AllProcs. 
In  the  single  copy  case,  for  example,  allocations  need  to  be  considered  only  to 
processors  that  already  have  partitions  allocated  to  them  or  their  neighbors.  As 
a  special  case,  the  first  partition  should  be  allocated  immediately  to  the  processor 
where  the  computation  will  begin  (which  is  assumed  to  be  the  same  as  the  processor 
where  the  final  result  will  be  demanded).  When  partitions  can  have  multiple  copies, 
this  gets  a  bit  more  involved  but  the  general  idea  is  the  same.  Notice  that  this 
optimization  does  not  reduce  the  size  of  PotentialProcs  in  the  worst  case,  which  is 
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4.1.4  Not  Locally  Optimal 

Once  allocated  to  a  processor,  a  partition  is  not  re-allocated  to  another  processor 
when  allocations  of  other  partitions  are  being  considered.  This  is  done  regardless  of 
any  new  communication  requirements  that  the  later  partitions  may  expose.  There¬ 
fore,  it  is  not  surprising  that  greedy  allocation  is  not  guaranteed  to  produce  a  locally 
optimal  allocation.  An  example  of  greedy  allocation  that  does  not  produce  a  locally 
optimal  solution  is  given  in  appendix  C.  Of  course,  if  the  solution  is  not  locally 
optimal,  it  is  also  not  globally  optimal. 


4.2  Local  Minimization 

This  section  contains  a  specification  of  the  algorithm,  a  description  of  the  algorithm, 
a  discussion  of  its  complexity,  and  an  example  to  show  that  the  allocations  produced 
are  not  necessarily  globally  optimal. 


4.2.1  Specifications 

Inputs 

1.  P:  a  set  of  partitions  of  the  database. 

2.  C:  a  function  that  takes  two  partitions  Pi  and  P2  and  returns  a  tuple  of 
the  form  <  data, number  >  where  data  is  the  amount  of  data  (in  bytes)  and 
number  is  the  number  of  messages  sent  from  partition  Pi  to  partition  P2.  data 
and  number  are  expected  values  in  a  probabilistic  sense. 

3.  PI:  a  function  that  takes  a  partition  and  returns  the  set  of  processing  intervals 
associated  with  the  partition. 

4.  Multiprocessor  constants:  These  are  K\ ,  K?,  and  K$  used  to  compute  com¬ 
munication  cost  as  given  by  equations  5  and  6. 
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5.  Topology:  This  includes  (1)  distances  between  all  pairs  of  processors  and  (2) 
lists  of  neighbors  of  each  processor. 

6.  An  allocation:  A  many-to-one  mapping  from  the  set  of  partitions  to  the  set 
of  processors. 

7.  Number  of  copies  for  each  partition 


Outputs 

1.  Allocation:  A  many-to-one  mapping  from  the  set  of  partitions  to  the  set  of 
processors. 


4.2.2  Algorithm 

Notice  from  the  specifications  given  above  that  the  number  of  copies  for  each  par¬ 
tition  is  already  fixed  by  the  greedy  allocation  procedure.  In  fact,  the  number  of 
copies  is  an  input  to  the  procedure. 

The  code  for  the  local  minimization  procedure  LocalMinimization  is  given  in 
figure  33.  The  idea  is  that  there  is  a  set  of  iterations  specified  by  the  outer  While 
loop.  In  each  iteration,  every  partition  is  considered  in  turn  (by  the  outer  one  of 
the  two  nested  For  loops.  The  best  allocation  is  picked  for  each  partition  among 
the  processor  it  is  currently  currently  allocated  to  and  its  neighbors — six  in  the  case 
of  FAIM-1.  This  is  done  in  the  inner  For  loop.  At  the  conclusion  of  the  inner  For 
loop,  the  partition  is  allocated  to  the  best  processor  among  the  ones  considered. 
If  this  is  different  from  the  processor  that  the  partition  was  allocated  to,  then  the 
boolean  variable  Changed?  is  set  to  true.  Therefore,  Changed?  gets  set  to  true 
if  one  or  more  partitions  get  reallocated  to  a  neighboring  processor.  The  While 
loop  terminates  when  Changed?  is  false,  or  equivalently  when  no  partitions  were 
reallocated  in  the  previous  iteration  of  the  While  loop. 
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Procedure  LocalMinimizationf ) 
begin 

Changed ?  <—  true; 

While  Changed ?  =  true  do  begin 
Changed ?  «—  nil; 

For  all  Partition  €  SetOf  Partitions  do  begin 
CurrProc  <—  Processor(P artition); 

BestCost  *—  Cost ; 

BestProc  <—  CurrProc, 

Deallocate  Partition  from  CurrProc, 

For  all  Processor  €  N eighbor {CurrProc)  do  begin 
Allocate  Partition  to  Processor', 

TempC ost  <—  Cost{ Allocation); 

If  TempC  ost  <  BestCost  then  begin 
BestCost  «—  TempCost; 

BestProc  <—  Processor 
end;  /*  If  */ 

Deallocate  Partition  from  Processor 
end;  /*  For  */ 

If  BestProc  CurrProc  then  begin 
Allocate  Partition  to  BestProc; 

Changed' ?  <—  true 
end  /*  If  */ 
end  /*  For  */ 
end  /*  While  */ 
end;  /*  LocalMinimization  */ 


Figure  33:  Procedure  LocalMinimization 
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Algorithm 

Complexity 

Single 

Copies 

Multiple 

Copies 

LocalMinimization 

0(qP(p2+pr2)) 

p  2  2 

0(q  (p  +pqr  )) 

One  iteration  of 

While  loop  in 
LocalMinimization 

2  2 

0(p  +pr  ) 

2  2 

0(p  +pqr  ) 

p  =  Number  of  partitions 

q  =  Number  of  processors 

r  =  Number  of  subgoals  at  compile-time 

Table  5:  Complexity  Results  for  Local  Minimization 

4.2.3  Complexity 

Table  5  summarizes  the  complexity  results  for  Local  Minimization.  Further  expla¬ 
nation  including  the  basis  for  the  results  is  given  below. 

In  the  worst  case,  LocalMinimization  may  consider  all  possible  allocations.  These 
are  exponential  in  number.  To  be  precise,  there  are  qp  allocations,  where  p  is  the 
number  of  partitions  and  q  is  the  number  of  processors.  Notice  that  after  each 
iteration  of  the  While  loop,  there  is  always  a  complete  allocation  that  is  the  lowest 
cost  allocation  found  so  far.  As  it  turns  out,  each  iteration  takes  polynomial  time 
(see  below).  Therefore,  if  the  algorithm  has  exceeded  some  time  limit,  it  can  be 
terminated  between  iterations  of  the  While  loop  and  the  latest  allocation  can  be 
used. 

The  time  taken  for  each  iteration  of  the  While  loop  can  be  analyzed  as  follows. 
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Each  partition  is  allocated  (and  deallocated)  7  times  in  each  iteration  of  the  While 
loop.  The  cost  for  updating  the  cost  function  for  each  allocation/ deallocation  is 
0(p  r2)  when  single  copies  of  partitions  are  used  and  it  is  0(p  +  qr2)  when 

multiple  copies  of  partitions  are  allowed  (see  previous  discussion  on  complexity 
of  Greedy  Allocation).  Since  there  are  p  partitions,  the  total  times  taken  for  an 
iteration  are  0(p 2  +  pr2)  and  0(p2  +  pqr2)  for  the  single  copy  and  multiple  copy 
cases  respectively. 

4.2.4  Not  Globally  Optimal 

Even  if  the  procedure  LocalMinimization  is  executed  till  it  terminates  (as  opposed  to 
just  a  few  rounds),  there  is  no  guarantee  that  the  locally  optimal  allocation  is  going 
to  be  globally  optimal  as  well.  Appendix  D  contains  an  example  of  an  allocation 
produced  by  LocalMinimization  that  is  not  globally  optimal. 


4.3  Experimental  Results 

PM ,  the  parallel  execution  model,  and  the  resource  allocation  algorithms  have  been 
implemented  in  Zetalisp  on  the  Symbolics  3600  series  of  Lisp  Machines  [44]  ^  PM 
and  the  simulated  version  of  PM  were  implemented  on  top  of  a  high-level  functional 
simulation  of  FAIM-1  using  the  event-driven  simulator  Helios  [24].  The  parallel 
interpreters  were  created  by  modifying  the  sequential  backward-chaining  interpreter 
in  MRS  [54],  a  logic  programming  system. 

Several  examples  have  been  tried  using  this  implementation.  One  of  these  will 
be  described  in  detail  to  demonstrate  the  utility  of  PM  and  the  resource  alloca¬ 
tion  techniques  developed  in  this  thesis.  The  example  logic  program  describes  the 
structure  and  behavior  of  a  digital  device— a  4-bit  adder.  In  addition,  a  set  of  facts 
describes  the  values  of  all  the  inputs.  The  goal  given  to  the  backward-chaining 
deduction  engine  is  to  determine  the  value  of  a  particular  output.  This  problem 
is  similar,  but  not  identical,  to  a  part  of  the  problem  of  test-generation  [59]:  the 


1  Zetalisp  and  Symbolics  are  trademarks  of  Symbolics,  Inc. 
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determination  of  values  for  a  set  of  inputs  that  would  force  an  output  (or  some 
other  intermediate  port)  to  a  particular  value. 

Detailed  information  about  the  example  is  given  in  appendix  E.  In  particular, 
the  appendix  contains  the  complete  database  for  the  example,  the  goal  given  to 
the  backward-chaining  engine,  the  partitioning  of  the  database,  the  FAIM-1  mul¬ 
tiprocessor  configuration  used,  other  multiprocessor  parameters,  and  finally  the 
allocations  generated  by  the  allocator.  Two  allocations  are  shown:  the  first  for  the 
single  copy  case  and  the  second  for  the  multiple  copy  case. 

Figure  34  shows  the  parallelism  profile  for  the  application.  The  profile  gives 
the  number  of  parallel  inferences  versus  time  assuming  unbounded  processors  and 
memory,  and  instantaneous  communication.  The  figure  shows  two  curves:  the 
curve  marked  “AOP”  shows  the  profile  when  arid-parallelism ,  or-parallelism ,  and 
pipelining  axe  exploited  and  the  curve  marked  “OP”  shows  the  profile  when  only 
or-parallelism  and  pipelining  are  exploited.  The  average  and  maximum  parallelism 
for  the  “AOP”  case  are  30.371  and  106  respectively.  The  same  numbers  for  the 
“OP”  case  are  12.745  and  37  respectively.  The  numbers  demonstrate  the  advantage 
of  exploiting  and-parallelism. 

The  same  curves  also  give  unreachable  lower  bounds  on  the  time  to  complete  the 
computation.  The  lower  bound  is  simply  the  maximum  time  value  for  the  curve. 
In  any  real  multiprocessor,  the  completion  time  will  be  greater  than  this  lower 
bound  because  it  will  have  only  a  limited  number  of  processors  (as  opposed  to  an 
unlimited  number  assumed  here)  and  non-zero  communication  delays  (as  opposed 
to  instantaneous  communication  assumed  here).  The  lower  bound  for  the  “AOP” 
case  is  35  logical  inference  time  units  and  the  lower  bound  for  the  “OP”  case  is 
51  logical  inference  time  units.  Again,  these  numbers  indicate  the  advantage  of 
exploiting  and-parallelism. 

The  curves  also  give  the  sequential  time  for  computation.  The  sequential  time 
is  simply  the  area  under  the  curve.  The  sequential  time  for  the  “AOP”  case  is 
1063  logical  inference  time  units  and  the  sequential  time  for  the  “OP”  case  is  650 
logical  inference  time  units.  Notice  that  the  sequential  time  for  the  “OP”  case  is 
lower  than  the  sequential  time  for  the  “AOP”  case.  Therefore,  if  only  one  processor 
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Figure  34:  Parallelism  Profile  for  Adder  Example 
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is  available,  it  is  more  efficient  to  exploit  less  parallelism  conceptually.  Of  course, 
it  could  have  been  the  other  way  axound  also  as  pointed  out  in  chapter  2.  The 
corresponding  unreachable  upper  bounds  on  speedups  for  “AOP”  and  “OP”  can 
be  computed  by  dividing  the  sequential  time  by  the  unreachable  lower  bound  on 
time  taken  in  the  parallel  case.  These  upper  bounds  are  1063/35  (=  30.371)  and 
650/51  (=  12.745)  “AOP”  and  “OP”  respectively.  Notice  that  these  unreachable 
speedup  numbers  are  the  same  as  the  average  parallelism  numbers  given  earlier  (as 
they  should  be). 

Earlier  experiments  with  smaller  examples  had  indicated  that  greedy  allocation 
by  itself  either  produced  locally  optimal  allocations  or  allocations  that  were  very 
close  to  locally  optimal.  The  experiments  described  for  the  adder  example  use 
greedy  allocation  only;  no  local  minimization  was  used. 

A  possible  explanation  for  greedy  allocation  turning  out  to  be  so  successful 
is  given  now.  The  only  hand-designed  situations  where  greedy  allocation  per¬ 
forms  poorly  are  cases  where  the  communication  and  processing  requirements  for  a 
dataflow*  graph  are  highly  non-uniform  (see  appendix  C  for  an  example).  In  the 
practical  examples  looked  at,  this  was  not  the  case  (i.e.,  processing  and  commu¬ 
nication  requirements  were  fairly  uniform).  In  particular,  for  the  adder  example 
being  considered  here,  all  communication  arcs  in  a  conjunct  graph  carry  a  single 
message,  if  they  carry  one  at  all.  This  follows  directly  from  the  fact  that  the  output 
of  a  hardware  component  is  a  function  of  the  inputs.  In  addition,  the  amount  of 
processing  associated  with  nodes  in  a  conjunct  graph  is  fairly  uniform.  The  number 
of  rules  that  apply  to  reducing  any  particular  goal  ranges  from  two  to  four  only. 

When  Greedy  Alio  cation  was  used  to  make  a  single  copy  allocation,  the  time 
taken  and  speedup  were  found  to  be  215.531  logical  inference  time  units  and  4.932 
respectively. 

While  using  the  single-copy  allocation  generated  by  the  allocator  program,  it  was 
noticed  that  certain  partitions  were  bottlenecks  in  the  computation.  The  first  clue 
came  from  monitoring  the  “busy-ness”  of  various  processors  during  the  parallel 
computation.2  The  second  clue  came  from  looking  at  the  parallelism  profile  of 


2Thi8  was  done  by  using  a  color  instrumentation  tool  in  Helios,  the  event-driven  simulator.  A 
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single  partitions.  Some  partitions  came  out  with  very  high  parallelism  for  some 
time  intervals  indicating  that  they  might  be  bottlenecks. 

Given  the  evidence  of  a  bottleneck  due  to  single  copies,  it  was  decided  to  allow 
multiple  copies  in  the  allocation.  The  procedure  Greedy  Allocation!'  was  used  alone 
without  any  local  minimization.  If  local  minimization  were  used,  it  would  only 
improve  on  this  allocation.  It  turns  out  that  the  time  taken  and  speedup  for  the 
allocation  generated  were  60.254  logical  inference  time  units  and  17.642  respectively. 
Compared  to  the  single  copy  case,  the  speedup  is  a  multiple  of  3.577  higher. 

A  random  allocator  was  used  to  generate  an  allocation  using  the  same  number 
of  copies  for  each  partition  as  that  used  by  Greedy  Allocation" .  Time  taken  and 
speedup  were  215.531  logical  inference  time  units  and  17.015  respectively. 

Greedy  Allocation  is  not  much  better  than  a  random  allocator  in  this  case  be¬ 
cause  communication  is  relatively  cheap  in  the  FAIM-1  multiprocessor  configuration 
considered.  However,  there  are  at  least  two  cases  where  a  random  allocation  can 
perform  arbitrarily  worse  than  a  greedy  allocation.  Both  of  these  two  cases  have 
the  characteristic  that  the  average  delays  in  the  random  allocation  case  are  arbi¬ 
trarily  larger  than  the  average  delays  expected  in  the  greedy  allocation  case.  The 
first  case  is  one  in  which  there  are  a  larger  number  of  processors.  A  larger  number 
of  processors  increases  the  average  distance  between  a  random  pair  of  processors. 
This  increases  the  expected  distance  for  communication  using  a  random  allocation. 
However,  greedy  allocation  does  not  use  more  processors  unless  that  decreases  the 
cost  function.  In  other  words,  adding  more  processors  does  not  necessarily  mean 
that  they  will  be  used  by  greedy  allocation.  The  second  case  is  one  in  which  dif¬ 
ferent  communication  hardware  is  used  and  communication  is  higher  even  for  the 
same  distances  as  before.  This  could  happen  if  a  different  multiprocessor  were  used 
that  did  not  have  a  high  degree  of  hardware  support  for  communication  (as  it  is  for 
FAIM-1). 

Now,  it  remains  to  be  seen  what  the  effect  of  higher  delays  is  on  random  allo¬ 
cations.  Figure  35  illustrates  this  effect.  The  figure  plots  speedup  versus  log  (base 

color  spectrum  from  blue  to  red  was  used  to  indicate  the  “busy-ness”  of  processors  represented  by 
icons,  with  red  being  used  to  indicate  the  busy  extreme  and  blue  being  used  to  indicate  the  idle 
extreme. 
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2)  of  delay  (expressed  as  a  multiple  of  the  normal  delay  expected  for  the  FAIM-1 
configuration)  for  a  set  of  experiments  performed  using  the  random  allocation  men¬ 
tioned  above.  Delays  to  the  left  of  the  speedup  axis  are  sub-normal  delays  (down 
to  2-10  times  the  normal  delay)  and  delays  to  the  right  are  super-normal  delays 
(up  to  210  times  the  normal  delay).  The  relative  flatness  of  the  curve  to  the  left 
of  the  speedup  axis  demonstrates  that  communication  is  not  a  bottleneck  in  this 
case.  However,  as  communication  delays  are  increased  beyond  the  normal  delays, 
the  speedup  for  random  allocation  drops  to  zero  asymptotically.  Let  us  see  how 
a  greedy  allocation  might  perform  in  the  two  cases  in  which  delays  are  increased. 
When  the  number  of  processors  is  increased,  the  speedup  expected  from  greedy  al¬ 
location  should  be  as  good  or  better  than  17.642  (the  speedup  for  the  configuration 
used  for  the  greedy  allocation  experiment  mentioned  earlier).  For  the  random  allo¬ 
cation  case,  a  delay  that  is  4  times  normal  drops  speedup  to  about  12.5.  Given  the 
topology  of  FAIM-1  and  the  multiprocessor  communication  constants,  it  turns  out 
that  this  delay  would  be  expected  when  the  number  of  processors  is  increased  to 
about  4000.  Let  us  look  at  the  other  case  now.  If  communication  delays  axe  higher 
overall  for  the  multiprocessor,  then  communication  cost  will  overwhelm  processor 
multiplexing  cost  beyond  a  certain  point.  Therefore,  all  computation  will  get  allo¬ 
cated  to  a  single  processor  and  speedup  will  be  1.  In  the  random  allocation  case, 
however,  it  could  be  arbitrarily  close  to  zero.  As  a  somewhat  less  extreme  case,  a 
delay  of  128  times  the  normal  FAIM-1  delay  drops  the  speedup  below  1  (see  figure). 
This  can  easily  happen  if  the  multiprocessor  does  not  have  the  type  of  specialized 
communication  support  that  FAIM-1  has. 


On  a  different  note,  it  was  mentioned  in  section  3.1.4  that  a  possible  improvement 
in  the  communication  cost  might  be  to  reduce  it  by  the  degree  of  communication 
parallelism.  There  is  some  evidence  that  this  might  be  true.  A  reasonable  measure 
of  the  degree  of  communication  parallelism  for  a  computation  might  be  the  average 
parallelism  given  by  its  parallelism  profile  (assuming  that  the  degree  of  communi¬ 
cation  parallelism  is  the  same  as  the  degree  of  processing  parallelism).  In  the  case 
of  the  adder  example,  this  is  30.371.  An  allocation  was  produced  by  reducing  the 
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Figure  35:  Speedup  vs.  Delay  for  Random  Allocation 
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communication  parameters  (Ki,  K2,  and  K3  in  equation  1,  section  3.1.4)  by  a  fac¬ 
tor  of  32  (closest  power  of  2  to  30.371).  This  allocation  produced  by  the  procedure 
Greedy  Allocation!'  gives  an  average  speedup  of  18.250  as  opposed  to  17.642  with  the 
normal  communication  parameters.  The  speedup  did  improve  by  taking  commu¬ 
nication  parallelism  into  consideration.  However,  this  single  data  point  should  be 
considered  as  suggestive  evidence  only.  Conclusive  proof  can  only  be  provided  by 
further  research.  Of  course,  there  may  be  more  accurate  methods  to  take  commu¬ 
nication  parallelism  into  account  and  the  associated  speedup  improvement  may  be 
even  greater. 


4.4  Related  Work 

4.4.1  Theoretical  work 

Previous  theoretical  work  on  scheduling  (or  allocation)  for  multiprocessors  [37,39]  is 
not  directly  applicable  here.  There  are  many  variations  on  the  scheduling  problem 
but  none  of  them  include  communication  cost  in  a  general  way.  There  are  many 
interesting  results,  however,  that  may  be  good  starting  points  for  extensions  that 
consider  communication.  Extensions  to  approximation  results  such  as  Graham’s 
[29]  would  be  tremendously  useful.  Another  extension  that  would  be  required  to 
attack  the  scheduling  problem  in  this  thesis  would  be  the  inclusion  of  memory 
constraints  that  limit  the  number  of  copies  of  certain  pieces  of  the  database  (or 
code  in  procedural  languages). 


4.4.2  Local  Search 

The  local  minimization  algorithm  discussed  in  this  chapter  is  an  application  of  a 
general  technique  called  Local  Search  in  the  optimization  literature  (see  book  by 
Papadimitriou  and  Steiglitz  [51],  for  example).  The  general  algorithm  is  described 
in  the  book  by  Papadimitriou  and  Steiglitz  as  follows: 

Given  an  instance  ( F,c )  of  an  optimization  problem,  where  F  is  the  feasible  set 
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and  c  is  the  cost  mapping,  we  choose  a  neighborhood 

N  :  F  — >  2f 

which  is  searched  at  point  t  €  F  for  improvements  by  the  subroutine 

.  .  |  any  s  €  N(t)  with  c(s)  <  c(t)  if  such  an  s  exists 

improve!  t)  =  < 

|  “no”  otherwise 

The  book  contains  many  examples  of  local  search  algorithms  applied  to  the 
travelling  salesman  problem  and  the  uniform  graph  partitioning  problem  among 
others.  In  addition,  the  book  identifies  some  general  issues  in  the  development  of 
such  algorithms.  In  many  cases,  local  search  has  turned  out  to  be  a  powerful  opti¬ 
mization  technique  and  is  often  the  best  available.  Unfortunately,  the  development 
of  local  search  algorithms  remains  largely  an  art  and  the  demonstrations  of  utility 
are  empirical  in  nature. 

Recall  that  in  this  thesis,  the  local  minimization  algorithm  turned  out  not  to  be 
very  important.  The  starting  point  for  local  minimization  (i.e.,  the  result  of  greedy 
allocation)  was  already  quite  good. 

4.4.3  Compile-time  Allocation  for  Dataflow 

4.4.3. 1  DDM2  from  University  of  Utah 

A  paper  by  Martha  Chamberlain  and  Alan  Davis  [11]  describes  what  was  probably 
the  first  attempt  at  static  allocation  of  dataflow  programs.  The  target  machine  was 
called  DDM2  (a  successor  to  DDM1)  and  a  single  processor  version  was  operational 
in  1979.  Timing  measurements  taken  from  the  single  processor  version  were  then 
used  to  emulate  a  multiple  processor  version  whose  topology  was  a  tree. 

The  input  to  the  allocator  is  a  type  of  dataflow  graph  called  DDN  (Data  Driven 
Net).  The  overall  goal  of  the  allocator  was  to  massage  this  graph  into  a  tree- 
structured  shape  preserving  as  much  of  the  locality  as  possible.  Function-preserving 
graph  transformations  such  as  replicating  nodes  and  inserting  dummy  nodes  for 
extra  synchronization  were  used. 
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The  overall  structure  of  the  allocator  consists  of  three  top-level  steps.  First, 
the  DDN  is  converted  to  a  TANTA  graph  (or  Two-terminal,  Acyclic  graph  with 
No  Transitive  Arcs).  Since  DDN’s  are  already  two-terminal,  this  phase  consists  of 
encapsulating  cyclic  iteration  structures  into  single  complex  nodes  and  removing 
transitive  axes.  The  second  top  level  step  is  the  conversion  of  TANTA  graphs  to  SP 
graphs  (or  series-parallel  graphs).  Different  methods  to  do  this  lead  to  minimum 
work  or  minimum  time  (i.e.,  minimum  critical  path).  The  third  and  final  step  is  to 
convert  the  SP  graph  to  a  tree  by  a  series  of  folding  operations. 

In  comparison  with  the  allocator  presented  in  this  thesis,  a  lot  of  processing  in 
the  DDM2  allocator  is  geared  specifically  towards  the  special-purpose  tree  topology. 
The  allocator  in  this  thesis  is  not  designed  for  any  particular  topology.  Another 
point  of  difference  is  that  the  DDM2  allocator  makes  the  simple  assumption  of 
equal  computation  cost  for  all  nodes  and  single  token  communication  along  all  arcs. 
A  considerable  amount  of  theory  was  developed  in  this  thesis  (in  chapter  3)  to 
generate  more  accurate  predictive  models  of  communication  and  processing.  An¬ 
other  difference  is  in  the  area  of  exploiting  the  tradeoff  between  parallelism  and 
communication  cost.  The  allocator  in  this  thesis  attempts  to  make  this  tradeoff 
systematically  based  on  the  separate  communication  cost  and  processor  multiplex¬ 
ing  cost  components  of  the  cost  function.  Program  fragments  that  produce  large 
amounts  of  communication  delay  relative  to  the  amount  of  parallelism  exposed  are 
allocated  to  the  same  processor.  In  the  extreme,  the  entire  program  may  get  allo¬ 
cated  to  the  same  processor  even  when  more  processors  axe  available.  The  DDM2 
allocator  will  expose  all  concurrency  if  there  axe  sufficient  numbers  of  processors 
available. 

4. 4.3. 2  Hughes  Dataflow  Multiprocessor 

Michael  Campbell  [10]  describes  another  method  for  the  compile-time  allocation  of 
dataflow  programs  to  the  Hughes  Dataflow  Multiprocessor.  The  multiprocessor  has 
a  bussed  cube  interconnection  network.  However,  the  allocation  algorithms  are  not 
designed  to  work  with  just  that  topology. 

Allocation  is  based  on  a  heuristic  cost  function  that  is  a  weighted  sum  of  a 
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communication  cost  and  processing  cost.  Communication  cost  associated  with  the 
allocation  of  a  single  node  in  the  dataflow  graph  is  the  sum  of  the  distances  of  arcs 
connected  with  the  node;  distance  is  simply  the  number  of  hops  from  the  processor 
associated  with  the  source  node  of  an  arc  to  the  processor  associated  with  the 
destination  node  of  the  arc.  No  consideration  is  given  to  the  size  of  the  data  in 
each  token  transmitted  along  an  arc  or  the  number  of  tokens.  The  processing  cost 
is  computed  by  first  finding  the  transitive  closure  of  the  graph.  Potentially  parallel 
nodes  are  those  that  do  not  have  an  arc  connecting  them  in  the  transitive  closure. 
The  processing  cost  associated  with  the  allocation  of  a  certain  node  to  a  processor 
is  computed  from  the  number  of  potentially  parallel  nodes  allocated  to  the  same 
processor.  Each  node  is  assumed  to  take  the  same  computation  time  and  no  special 
consideration  is  given  to  the  multiple  invocation  of  a  node. 

The  differences  from  this  thesis  are  the  following:  (1)  A  much  simpler  model  of 
communication  is  assumed  here.  (2)  A  much  simpler  model  of  processing  is  assumed. 
(3)  Potentially  parallel  computations  are  found  by  computing  the  transitive  closure 
of  the  graph.  The  allocator  in  this  thesis  performs  an  abstract  simulation  with 
probabilistic  analysis  to  find  parallel  computations.  (4)  A  node  may  be  allocated 
to  a  single  processor  only.  We  allow  multiple  copies. 

4. 4. 3.3  Vivek  Sarkar’s  thesis 

In  his  thesis  Partitioning  and  Scheduling  Parallel  Programs  for  Execution  on  Multi¬ 
processors  [55],  Vivek  Sarkar  describes  another  approach  to  compile-time  allocation 
for  dataflow  programs.  This  approach  is  interesting  because  it  takes  completion 
time  as  the  cost  function  as  opposed  to  a  combination  of  communication  and  pro¬ 
cessing.  Some  differences  from  this  thesis  are  described  below.  First,  it  assumes 
that  each  processor  has  sufficient  memory  to  execute  the  entire  program  unlike  the 
approach  in  this  thesis.  Second,  profile  information  is  used  for  estimates  as  opposed 
to  probabilistic  estimates  in  this  thesis.  Finally,  it  is  claimed  that  the  approach  is 
applicable  to  topologies  in  which  there  could  be  delays  that  are  a  function  of  the 
distance  between  processors.  However,  all  experiments  reported  assume  delays  that 
are  independent  of  distance. 
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4.4.4  Kemal  Oflazer’s  Thesis  on  Partitioning  of  Production 
Systems 

Kemal  Oflazer  discusses  the  partitioning  problem  for  Production  Systems  (or  Rule- 
Based  Systems),  specifically  0PS5  [23],  in  his  thesis  Partitioning  in  Parallel  Process¬ 
ing  of  Production  Systems  [50]  and  an  earlier  paper  [49].  This  partitioning  problem 
is  described  as  the  compile-time  allocation  of  productions  (or  rules)  to  processors 
in  such  a  way  that  the  total  time  of  execution  is  minimized. 

A  production  system  interpreter  repeatedly  executes  a  recognize-act  cycle.  This 
cycle  consists  of  3  phases — Match ,  Conflict-Resolution ,  and  Act.  The  Match  phase 
finds  all  productions  that  may  be  fired,  the  Conflict- Resolution  phase  picks  a  single 
production  to  be  fired,  and  the  Act  phase  performs  the  changes  to  the  database 
mandated  by  the  chosen  production.  Note  that  the  Conflict-Resolution  phase  is  a 
synchronization  point  during  every  cycle. 

In  Oflazer’s  parallel  processor  organization  for  partitioning,  a  set  of  processors 
contains  mutually  exclusive  and  exhaustive  subsets  of  the  productions  in  the  sys¬ 
tems.  Each  processor  also  contains  the  state  associated  with  its  subset  of  the  pro¬ 
ductions.  The  goal  of  each  processor  is  to  make  any  changes  to  its  state  mandated 
by  the  previous  Act  phase,  find  the  matching  productions,  and  report  them  to  some 
central  processing  location.  The  central  processor  performs  the  Conflict  Resolution 
phase  and  identifies  the  state  changes  mandated  by  the  chosen  production  to  the 
relevant  processors.  Since  most  of  the  processing  in  production  systems  takes  place 
during  the  Match  phase,  Oflazer’s  model  ignores  the  processing  cost  during  the 
Conflict- Resolution  phase  and  the  Act  phase.  In  addition,  communication  cost  be¬ 
tween  the  parallel  processors  and  the  central  processing  location  is  ignored  because 
it  is  a  small  amount  of  data. 

This  work  is  different  from  our  model  in  the  following  ways.  First,  the  presence 
of  a  synchronization  point  during  every  interpreter  cycle  makes  it  a  very  different 
type  of  computation.  There  are  no  such  synchronization  points  in  dataflow*  graphs. 
Second,  communication  cost  is  not  a  factor  in  Oflazer’s  work  whereas  it  is  a  central 
focus  of  the  work  in  this  thesis.  Third,  Oflazer  takes  the  estimates  for  processing 
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costs  from  previous  executions  of  the  same  production  system.  In  our  case,  estimates 
are  produced  by  probabilistic  analysis. 

4.4.5  Compile-time  Allocation  of  Actor  Languages 

Bill  Athas  has  recently  completed  a  thesis  on  compile-time  allocation  for  a  concur¬ 
rent,  object-oriented  programming  language  called  Cantor  [4].  Unfort unately,  the 
thesis  was  not  available  in  time  to  make  a  detailed  comparison. 

4.4.6  Run-time  Allocation 

A  lot  of  research  has  been  done  in  the  area  of  run-time  allocation  for  many  different 
types  of  computations.  It  is  not  possible  to  discuss  all  the  work  here  but  some 
interesting  pieces  of  work  are  mentioned  below.  As  mentioned  earlier  in  chapter  1, 
run-time  allocation  has  the  disadvantage  that  the  overhead  of  decision-making  must 
be  paid  at  run-time.  However,  if  the  behavior  of  the  program  is  highly  dynamic 
and  is  hard  to  predict  at  compile-time,  then  run-time  allocation  may  be  the  best 
approach. 

Smith  [66]  has  presented  a  protocol  called  Contract  Net  to  dynamically  distribute 
tasks  among  processors  in  a  distributed  system.  Each  task  is  distributed  using 
an  Announcement-Bid- Award  sequence.  A  task  to  be  distributed  is  announced  as 
being  available,  processors  may  bid  to  do  the  task,  and  the  announcing  processor 
may  then  award  the  contract  to  one  of  the  processors.  The  idea  was  to  propose 
a  more  flexible  framework  than  some  other  rigid  frameworks  like  remote  procedure 
calls  [47],  for  example. 

Malone  et.  al  [42]  have  proposed  an  interesting  specialization  of  the  Contract 
Net  (called  Enterprise )  and  showed  some  good  connections  to  scheduling  theory 
results.  Singh  and  Genesereth  [61]  proposed  another  specialization  of  the  Contract 
Net  (called  Variable  Supply  Model)  that  was  shown  to  be  an  efficient  and  flexible 
approach  to  distributing  or-parallel  tasks  on  a  broadcast  network. 

Hornig  [36]  has  designed  a  distributed  reduction-style  interpreter  for  a  functional 
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language  he  designed  called  Stardust.  An  interesting  feature  of  this  work  is  that  user- 
defined  functions  are  annotated  with  time  estimates  provided  by  the  user.  Time 
estimates  can  be  arbitrary  functions  of  the  arguments  to  the  function.  Several 
examples  were  presented  in  which  time  estimates  can  be  provided  reasonably.  Such 
time  estimates  could  be  useful  for  compile-time  allocation  a s  well. 

Haridi  and  Ciepielewski  [33]  have  described  a  token-pool  mechanism  to  distribute 
or-parallel  logic  programs.  The  idea  is  that  or-parallel  computations  are  encapsu¬ 
lated  in  tokens.  These  tokens  may  be  placed  in  the  pool  as  they  are  generated 
and  picked  up  by  other  processors.  The  difference  from  the  Contract  Net  is  that 
computations  are  not  handed  over  directly  from  the  spawning  processor  to  the  con¬ 
tracting  processor.  The  token  pool  acts  as  an  intermediary  between  the  spawning 
processor  and  the  contracting  processor.  However,  the  token  pool  seems  to  be  a 
passive  entity.  Therefore,  the  spawning  processor  does  not  have  any  control  over 
which  contracting  processor  gets  selected  for  any  spawned  computation. 

Hermenegildo  [34]  and  some  others  have  provided  an  interesting  twist  to  this 
idea  of  the  token  pool.  The  idea  is  that  computations  that  can  be  spawned  off  to 
remote  processors  are  simply  kept  in  local  storage  at  some  well-known  location.  Re¬ 
mote  processors  can  retrieve  these  parallel  computations  completely  independently 
without  any  intervention  of  the  local  processor.  Some  special  hardware  may  be 
needed  for  this  mechanism  but  it  has  the  advantage  that  the  busy  processors  do  not 
have  to  pay  the  overhead  of  distribution.  It  is  the  idle  processors  that  must  spend 
some  time  searching  for  some  parallel  computations  to  start  working  on. 


4.4.7  Programmed  Allocation 

Shapiro  [58]  has  described  a  notation  for  programmers  to  specify  their  own  allo¬ 
cations  for  Concurrent  Prolog  programs  [57].  The  notation  is  based  on  the  turtle 
notation  of  LOGO  programs  [52]  and  is  very  elegant.  However,  the  programmer 
must  have  a  very  good  idea  of  the  structure  of  the  program  to  make  use  of  it.  In 
cases  where  the  dynamic  behavior  of  the  program  is  not  well-known  by  the  user, 
user- specified  allocations  are  not  likely  to  perform  well. 
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4.5  Conclusions 

This  chapter  contained  the  description  of  a  compile-time  allocation  strategy  based 
on  a  cost  function  that  is  not  specific  to  any  particular  domain  or  multiprocessor. 
It  was  shown  that  the  algorithms  involved  are  tractable  (i.e.,  they  have  polynomial 
worst-case  time  complexity).  For  the  4-bit  adder  example,  this  allocation  strategy 
produced  speedups  that  were  more  than  half  an  unreachable  upper  bound.  In  the 
FAIM-1  configuration  considered,  communication  costs  are  not  high;  therefore,  even 
a  random  allocation  does  quite  well  (though  not  as  well  as  greedy  allocation).  In 
general,  it  is  possible  that  random  allocations  may  perform  arbitrarily  worse  than 
the  allocation  strategy  presented  here. 


Chapter  5 


Conclusions 


5.1  Summary  of  Key  Ideas 

In  this  thesis,  we  presented  solutions  to  two  problems:  (1)  the  design  of  a  parallel 
execution  model  for  backward-chaining  deductions  and  (2)  the  allocation  of  the 
resulting  parallel  computations  to  an  interesting  class  of  multiprocessors. 

The  target  class  of  multiprocessors  has  the  following  properties:  (1)  there  are  Jin 
arbitrary  number  of  MIMD  processors;  (2)  each  processor  has  some  local  memory 
but  there  is  no  global  memory;  (3)  processors  can  communicate  only  by  sending 
messages  to  each  other;  (4)  message  delay  is  a  function  of  the  amount  of  data  in  the 
message  and  the  distance  between  source  and  destination;  and  (5)  each  processor 
can  perform  backward-chaining  deductions  based  on  the  subset  of  the  program  that 
it  contains. 

PM,  the  parallel  execution  model  described  in  chapter  2,  exploits  more  paral¬ 
lelism  than  other  execution  models  that  use  data-driven  control  and  the  same  target 
class  of  multiprocessors.  In  particular,  PM  exploits  or-paralleliam,  and-parallelism 
and  pipelining.  The  extra  parallelism  can  be  an  important  advantage  in  a  situa¬ 
tion  where  a  large  number  of  processors  are  available.  Data-driven  control  leads  to 
minimal  synchronization  overhead  and  means  that  the  inherent  parallelism  can  be 
fully  exploited.  The  chapter  included  a  correctness  theorem  that  stated  that  the 
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set  of  solutions  produced  by  PM  is  identical  to  the  set  of  solutions  produced  by  a 
Prolog  interpreter.  PM  does  not  assume  that  the  entire  program  can  be  stored  in 
each  processor’s  local  memory.  Therefore,  larger  programs  can  be  run  compared  to 
the  case  in  which  a  copy  of  the  entire  program  is  required  in  each  processor  s  local 
memory. 

We  described  a  compile-time  allocation  strategy  for  PM  in  chapters  3  and  4.  In 
order  to  compare  different  allocations,  the  strategy  uses  a  cost  function  (described 
in  chapter  3)  that  applies  to  any  application  and  multiprocessor  (in  the  target  mul¬ 
tiprocessor  class).  The  cost  function  attempts  to  capture  intuitive  notions  of  the 
quality  of  allocations.  The  completion  time  of  the  computation,  assuming  zero  com¬ 
putation  delays  and  infinite  processors,  is  the  completion  time  for  the  associated 
parallelism  profile.  The  non-zero  delays  and  sequentialization  of  parallel  computa¬ 
tion  associated  with  a  realistic  multiprocessor  will  increase  this  completion  time. 
The  cost  function  is  defined  to  be  an  upper  bound  on  this  additional  delay  assuming 
that  the  effects  of  non-zero  communication  delays  and  sequentialization  of  parallel 
computation  (due  to  a  finite  number  of  processors)  are  independent  and,  therefore, 
additive.  The  upper  bound  on  the  extra  delay  due  to  non-zero  communication  is 
given  by  the  sum  of  ^11  communication  delays.  This  is  called  the  communication 
cost  of  the  computation.  The  upper  bound  on  the  extra  delay  due  to  sequentializa¬ 
tion  of  parallel  computation  is  called  the  processor  multiplexing  cost.  The  overall 
cost  is  the  sum  of  the  communication  cost  and  the  processor  multiplexing  cost. 

An  important  feature  of  this  cost  function  is  that  it  can  be  efficiently  computed 
and  recomputed  (for  small  changes  in  the  allocation).  Algorithms  were  presented 
for  this  computation  and  recomputation.  Unfortunately,  the  algorithms  require 
certain  restrictions  that  PM  does  not  require.  First,  the  type  of  backward-chaining 
deduction  is  restricted.  In  particular,  no  recursive  clauses  are  allowed,  unit  clauses 
must  be  ground,  and  certain  probabilistic  uniformity  and  independence  assumptions 
must  apply.  Second,  a  partitioning  of  the  database  is  assumed  to  be  given. 

Some  of  the  probabilistic  techniques  used  in  the  cost  computation  algorithms 
should  be  useful  in  other  contexts  as  well.  A  couple  of  examples  are  given  below. 
First,  the  Communication  Estimation  algorithm  computes  the  expected  amount  of 
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communication  between  each  pair  of  partitions.  Since  the  trade-off  between  com¬ 
munication  and  parallelism  seems  to  be  so  fundamental  for  the  allocation  problem, 
estimating  communication  should  be  useful  for  other  allocation  strategies.  Second, 
computing  the  parallelism  profile  is  a  side-effect  of  the  processor  multiplexing  cost 
computation.  Parallelism  profiles  have  been  used  for  allocation  strategies  other  than 
the  one  described  here  [55].  In  addition,  they  are  used  sometimes  simply  for  the 
purpose  of  estimating  the  amount  of  parallelism  inherent  in  an  application. 

In  chapter  4,  we  described  a  search  strategy  for  finding  a  satisfactory  allocation  in 
the  space  of  possible  allocations.  The  search  strategy  consisted  of  a  greedy  allocation 
phase  followed  by  a  local  minimization  phase.  Greedy  allocation  allocates  partitions 
of  the  database  to  processors  one  at  a  time.  A  partition  is  allocated  to  the  lowest 
cost  processor  without  re-allocating  any  partitions  that  were  allocated  previously. 
The  local  minimization  phase  consists  of  a  sequence  of  cost-reducing  re-allocations 
of  partitions  to  neighboring  processors  till  a  local  minimum  is  reached.  It  was  shown 
that  both  greedy  allocation  and  each  round  of  local  minimization  have  worst-case 
time  complexities  that  are  polynomial. 

Experiments  indicate  that  greedy  allocation  alone  produces  quite  satisfactory 
answers.  For  the  4-bit  digital  adder  example  that  was  tried  on  a  simulation  of  the 
FAIM-1  multiprocessor,  the  speedup  achieved  by  using  the  greedy  allocation  was 
more  than  half  of  an  unreachable  upper  bound.  Also,  the  speedup  achieved  was 
somewhat  better  than  that  achieved  by  using  random  allocation.  More  analysis  re¬ 
vealed  that  random  allocation  works  so  well  because  this  particular  example  is  not 
communication  intensive  at  all.  There  are  at  least  two  cases  where  the  difference  in 
performance  between  the  allocation  strategy  advocated  and  the  random  allocation 
strategy  can  be  expected  to  be  significant.  First,  a  higher  number  of  processors  will 
increase  the  average  distance  and,  therefore,  the  average  delay  for  the  random  allo¬ 
cation  case.  However,  average  distances  need  not  increase  at  all  for  the  allocation 
strategy  advocated  when  more  processors  are  used.  Second,  higher  communication 
constants  associated  with  a  different  multiprocessor  with  less  communication  sup¬ 
port  can  cause  the  speedup  to  be  arbitrarily  close  to  zero.  However,  the  allocation 
strategy  advocated  here  will  allocate  all  computation  to  a  single  processor  (with  a 
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speedup  of  1)  when  communication  cost  overwhelms  processor  multiplexing  cost. 


5.2  Directions  for  Future  Research 

Two  versions  of  the  greedy  allocation  algorithm  were  described  in  chapter  4— 
Greedy  Allocation'  and  Greedy Allocation" .  However,  experiments  were  conducted 
only  with  Greedy  Allocation".  It  is  quite  possible  that  Greedy  Allocation!  will  lead  to 
better  allocations  since  the  number  of  multiple  copies  is  chosen  in  a  less  arbitrary 
manner  than  in  Greedy  Allocation".  The  disadvantage  of  using  Greedy  Alio  cation!  is 
that  it  has  a  higher  time  complexity. 

A  constraint  that  was  kept  in  mind  while  designing  the  current  cost  function 
was  to  make  recomputation  efficient  when  small  changes  are  made  to  the  allocation. 
However,  as  the  experiments  indicate,  greedy  allocation  by  itself  produced  quite 
reasonable  allocations  without  using  local  minimization  at  all.  Since  recomputation 
is  useful  only  for  local  minimization,  there  is  the  possibility  now  of  using  a  different 
cost  function  that  is  not  as  pessimistic  as  the  current  cost  function  and  one  that  is 
not  necessarily  designed  for  efficient  recomputation.  A  more  accurate  cost  function 
of  this  type  has  the  potential  of  improving  the  quality  of  the  greedy  allocation 

algorithm. 

At  present,  the  allocation  techniques  do  not  apply  to  recursive  cases.  If  arbitrary 
recursions  are  allowed,  it  becomes  undecidable  to  predict  the  amount  of  processing 
and  communication  required  for  a  parallel  computation.1  Therefore,  good  alloca¬ 
tion  decisions  are  unlikely.  However,  it  may  be  possible  to  reason  automatically 
about  restricted  recursive  cases.  Even  in  cases  where  completely  automatic  alloca¬ 
tion  is  not  possible,  users  may  provide  information  about  parallel  computation  and 
communication  to  make  reasonable  allocation  decisions  possible. 

In  many  Artificial  Intelligence  problems,  a  single  solution  is  required  for  the 
problem  at  hand.  It  should  be  possible  to  extend  PM  to  kill  off  redundant  processes 
when  the  first  solution  has  been  found.  It  may  be  harder  to  extend  the  allocation 
techniques  to  reason  about  the  modified  parallel  execution  model.  On  a  related 


1This  follows  directly  from  the  halting  problem. 
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issue,  PM  should  be  modified  to  kill  off  processes  associated  with  sibling  and-nodes 
when  no  solution  is  found  for  any  one  of  the  and-nodes. 

Over  the  years,  researchers  have  developed  compilation  techniques  for  Prolog 
that  make  it  execute  at  comparable  speeds  with  other  programming  languages  for 
comparable  problems  [73,72].  More  attention  should  be  directed  towards  applying 
this  compilation  technology,  perhaps  with  extensions,  to  parallel  execution  models 
like  PM. 

Although  backward-chaining  deduction  has  been  found  to  be  very  useful  for  a 
wide  range  of  problems,  other  types  of  deduction  are  more  natural  for  certain  ap¬ 
plications.  For  example,  simulation  is  better  done  with  forward-chaining  deduction 
[60]  and  planning  problems  are  better  handled  with  Residue  [21].  Techniques  for 
exposing  the  parallelism  in  these  types  of  deduction  will  be  needed  if  the  associated 
applications  are  to  be  speeded  up. 

The  allocation  techniques  described  in  this  thesis  were  directed  towards  Horn 
clause  databases  without  any  additional  annotations.  In  the  literature,  this  is  called 
the  implicit  parallelism  case  for  logic  programming  in  contrast  to  logic  programming 
languages  that  require  explicit  annotations  to  express  producer-consumer  relation¬ 
ships  between  processes.  Explicitly  parallel  logic  programming  languages  include 
Concurrent  Prolog  [57],  PARLOG  [30],  and  Guarded  Horn  Clauses  (GHC)  [71].  The 
extent  to  which  the  allocation  techniques  in  this  thesis  are  applicable  to  these  lan¬ 
guages  remains  to  be  seen.  Going  even  further,  the  applicability  of  the  allocation 
techniques  to  other  programming  paradigms  like  object-oriented  languages  (e.g., 
Actors  [1])  and  Lisp-based  languages  (e.g.,  Qlisp  [26]  and  Multilisp  [31,32])  should 
be  investigated. 

As  mentioned  earlier,  compile-time  allocation  works  best  when  good  estimates 
can  be  made  at  compile-time  about  run-time  program  behavior.  If  good  compile¬ 
time  predictions  can  be  made  for  some  parts  of  the  program  and  not  for  others, 
it  may  make  sense  to  use  a  hybrid  strategy  using  both  compile-time  and  run-time 
allocation.  A  hybrid  strategy  may  also  include  some  user-specified  allocations  when 
the  user  already  knows  how  to  allocate  a  piece  of  the  computation  exceptionally 
well. 
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Appendix  A 


Partial  Order  Algorithm 


This  algorithm  describes  how  to  pick  a  particil  order  for  a  conjunctive  goal.  In 
particular,  the  partial  order  is  represented  by  a  directed,  acyclic  graph  of  nodes 
representing  the  conjuncts. 

On  invoking  a  rule  in  backward-chaining,  the  antecedents  of  the  rule  become  a 
new  conjunctive  subgoal  that  the  inference  engine  may  try  to  prove.  Assume  that 
appropriate  bindings,  resulting  from  the  unification  of  the  goal  with  the  consequent 
of  the  rule,  have  been  plugged  into  the  antecedents. 


A.l  Definitions 

Let  C\  through  Cn  be  the  antecedents  of  the  rule  in  order  from  left  to  right.  Let 
CL  be  the  ordered  set  of  the  antecedents  of  the  rule. 

CL  =<  Ci,  C21  •  •  • ,  Cn  > 

The  function  v  is  defined  to  take  a  literal  as  argument  and  return  the  set  of 
variables  in  the  literal.  For  example, 

»(P(x,y,c))  =  {jr,r} 
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The  function,  vl  is  defined  to  take  an  ordered  set  of  literals  and  return  the  set  of 
variables  in  the  literals. 

n 

vl(<CuC2,...,Cn>)  =  \Jv(Ci) 

i= 1 

For  example, 

vl(<  p(X,Y,cl),q(V,Z,c2)  >)  =  { X,Y,Z } 

Let  d(Ci,Cj)  be  true  if  and  only  if  there  is  a  directed  arc  between  the  corre- 
sponding  nodes  in  the  conjunct  graph . 

As  described  in  chapter  2,  PM  allows  conjuncts  to  be  solved  in  parallel  only 
if  previously  solved  conjuncts  have  already  bound  any  shared  variables  that  they 
may  have.  Let  us  call  this  constraint  the  shared-variable  constraint.  Restating  the 
constraint,  a  single  conjunct  must  first  bind  any  given  variable  in  vl(CL),  where  CL 
is  the  ordered  set  of  conjuncts,  before  other  conjuncts  that  share  the  same  variable 
can  be  solved.  This  distinguished  conjunct  is  called  the  generator  conjunct  for  the 
variable  in  question.  Let  g(V ,  C; )  be  true  if  and  only  if  C{  is  the  generator  conjunct 
of  the  variable  V . 


A. 2  Assumption 

No  assertions  (i.e.,  unit  clauses  in  a  horn  clause  database)  contain  any  variables. 

A.3  Algorithm 

Input:  CL ,  an  ordered  set  of  conjuncts 

Output:  A  conjunct  graph  (i.e.,  a  set  of  directed  arcs  between  the  conjuncts)  such 
that  (1)  the  partial  order  represented  by  the  conjunct  graph  is  a  subset  of  the  total 
order  given  in  the  input,  (2)  the  partial  order  is  the  minimal  one  satisfying  condition 
(1)  and  the  shared-variable  constraint ,  and  (3)  the  conjunct  graph  is  a  minimal 
representation  of  the  partial  order.  The  term  “minimal”  is  used  with  reference  to 
the  number  of  edges. 
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Condition  (1)  is  chosen  because  it  is  expected  that  if  the  original  total  order 
is  an  efficient  one,  then  subsets  of  it  are  also  efficient.  Condition  (2)  is  chosen 
so  that  parallelism  is  maximized.  Condition  (3)  is  chosen  so  that  communication 
requirements  for  PM  are  minimized.  Reduced  communication  also  translates  into 
reduced  computation  at  the  nodes  where  the  communication  is  directed. 

There  are  three  parts  of  the  algorithm  and  these  are  now  described  one  by  one. 

The  first  part  of  the  algorithm  picks  a  generator  conjunct  for  each  variable.  For 
each  variable  in  vl{CL ),  pick  the  leftmost  conjunct,  C, ,  in  CL,  such  that  the  variable 
is  contained  in  v(Ci).  This  conjunct  is  declared  to  be  the  generator  of  the  variable 
in  question.  The  complexity  of  this  part  of  the  algorithm  is  0(n  x  k),  where  n  is 
the  number  of  conjuncts  and  k  is  the  number  of  variables. 

This  can  be  illustrated  with  an  example.  Consider  the  conjunctive  goal 

p(X)Aq(Y)As(X,Y) 


In  this  case, 


CL  =<  C\,Ci,Cz  > 

Cx  =  p(X) 

C2  =  q(Y) 

C3  =  s(X,Y) 
vl{CL)  =  {X,Y} 
v(Ci)  =  {X} 
v(C2)  =  {Y} 
v{C3)  =  {X,Y) 

The  generator  conjuncts  are  described  by  g(X,  <7i)  and  g(Y,  C2). 

In  the  second  part  of  the  algorithm,  directed  arcs  are  introduced  between  the  gen¬ 
erator  conjuncts  and  other  conjuncts.  For  each  generator  conjunct  and  each  other 
conjunct  that  contains  the  variable  generated  by  the  generator,  insert  a  directed  arc 
between  the  corresponding  nodes  in  the  partial  order  graph.  The  complexity  of  this 
is  0(n2),  where  n  is  the  number  of  conjuncts  in  CL.  Again,  this  is  best  illustrated 
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by  an  example.  Consider  the  same  example  that  was  just  considered  above  for  pick¬ 
ing  the  generator  conjuncts.  Since  the  generator  conjunct  for  variable  X  is  0\  (i.e., 
p(X))  and  it  is  the  case  that  C3  (i.e.,  s(X,Y))  contains  the  same  variable,  a  directed 
arc,  d(Ci,  C3),  is  introduced.  Similar  reasoning  leads  to  the  only  other  directed 
arc  d(C2}C3).  At  this  point,  the  partial  order  described  by  the  set  of  directed  arcs 
satisfies  the  shared-variable  constraint.  However,  Ibis  may  not  be  a  minimal  partial 
order  satisfying  the  constraint  as  shown  in  a  different  example  below. 

It  is  possible  that  the  partial  order  generated  by  the  algorithm  so  far  is  as  given 
below: 


{d(Ci,  C2),  d(C2,C3),  d{Cu  <73)} 

This  would  happen  if  the  variables  contained  in  C\,  C2 ,  and  C3  are  {X} ,  {X ,  Y} , 
and  {X,Y,Z}  respectively.  The  arc  d{CuC3)  represents  a  redundant  arc  and  can 
be  removed  while  still  maintaining  the  shared-variable  constraint.  Such  arcs  are 
called  transitive  arcs.  An  arc  is  a  transitive  arc  if  and  only  if  there  is  a  longer  path 
between  the  end  nodes  of  the  arc. 

A  paper  by  Aho,  Garey,  and  Ullman  [2]  shows  how  to  remove  all  these  transitive 
arcs  from  a  directed,  acyclic  graph  in  time  0(n3),  where  n  is  the  number  of  vertices. 
The  output  of  the  algorithm  is  called  the  transitive  reduction  of  the  input  graph. 
This  transitive  reduction  algorithm  is  the  third  part  of  the  partial  order  algorithm. 

The  overall  complexity  of  the  partial  order  algorithm  is  obtained  by  adding  the 
complexities  of  the  three  component  procedures.  The  complexity  is  0{nxk+n2+n3) 
or  0(ra3),  assuming  that  k  is  0(n3). 


A. 4  Another  Example 

•  Consider  the  rule 

color(A,B,C,D,E)  :  — 
next(A ,  B )  A  next{C,  D)  A  next(A,  C)  A  next(A ,  D)  A 
next(B , C )  A  next(B,  E )  A  next(C ,  E )  A  next^D,  E ) 


A.4.  ANOTHER  EXAMPLE 
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This  Till*  is  part  of  the  database  used  for  a  particular  instance  of  the  four  color 
problem.  A  goal 

color(A,B,C,D,E ) 

would  generate  the  conjunctive  subgoal 

nezi(A,  B)  A  next(C ,  D)  A  next(A,  C )  A  next(A,  D)  A 
next(B ,  C )  A  next(B,  E)  A  nexi(C ,  22)  A  next(D,  E) 

Now,  next(A,  5)  is  the  generator  for  both  A  and  B.  Also,  next(C,  D)  is  the  gener¬ 
ator  for  both  C  and  D.  Finally,  next(B ,  E)  is  the  generator  for  E. 

The  partial  order  contains  the  following  directed  arcs:  From  Tiext(A,  B)  to  each 
member  of  {next(AfO),nezt(A,l?),next(B,C),next(B,E)},  from  next(C,  D)  to 
each  member  of  { next(A ,  C),next(A ,  D),next{ByC),  next(C ,  E ),  next(D ,  22)},  and 
from  nexi(B,E)  to  each  member  of  {next(C,E),next(D,E)}. 

There  are  no  transitive  arcs  to  remove  in  this  case. 
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Appendix  B 

Details  of  Procedure  PMCCA-1 


The  procedure  PMCCA-1  is  used  to  compute  processor  multiplexing  cost  for  a 
single  processor.  Chapter  3  described  this  procedure  but  omitted  details  of  the 
two  procedures  InsertPI  and  DeletePI.  These  procedures  are  given  in  this  appendix 
in  more  detail  (in  sections  B.2  and  B.3).  The  procedures  use  the  abstract  data 
structure  called  PQ-list  and  its  description  is  repeated  in  section  B.l  for  the  reader’s 
convenience. 

Pseudo-Pascal  code  is  given  for  the  procedures,  with  comments  being  delimited 
by  “/*”  on  the  left  and  “*/”  on  the  right. 

B.l  PQ-list  Data  Structure 

This  abstract  data  structure  has  three  associated  abstract  operations  as  described 
in  chapter  3. 

1.  InsertPQL(PQL,  element,  key):  This  inserts  the  element  element  into  the 
PQ-list  PQL  in  log  time.  In  addition,  the  CLoad  field  of  the  element  is  set 
to  the  CLoad  field  of  the  previous  element  (in  sorted  order).  If  there  is  no 
previous  element,  then  the  field  is  set  to  zero. 

2.  DeletePQL(PQL,  element):  This  deletes  element  from  PQL  in  log  time. 
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3.  EnumeratePQL(PQL,  elementl,  elements):  Enumerates  all  elements  in  PQL 
in  sorted  order  from  the  element  elementl  to  the  element  elements.  This  is 
done  in  time  linear  in  the  number  of  elements  enumerated. 

The  list  is  doubly-linked  to  allow  forward  or  backward  traversal.  The  utility 
of  backward  pointers  will  become  apparent  later  in  the  description  of  procedures 
InsertPI  and  DeletePI. 


B.2  Procedure  InsertPI 

The  InsertPI  procedure  is  given  in  figure  36. 

The  insert  procedure  uses  InsertPQL  to  insert  the  two  end-points  of  the  process¬ 
ing  intervals  as  two  elements  into  the  PQ-list.  It  also  enumerates  all  the  elements 
from  the  start  element  to  the  finish  element  and  modifies  their  CLoad  appropriately 
to  reflect  the  change.  In  addition,  PMC  is  changed  as  each  element  is  considered. 
The  correct  PMC  is  available  at  the  end  of  the  procedure. 


B.3  Procedure  DeletePI 

The  DeletePI  procedure  is  given  in  figure  37. 

The  delete  procedure  enumerates  all  the  elements  from  the  start  element  to  the 
finish  element.  It  modifies  the  CLoad  values  of  the  elements  appropriately.  Also, 
PMC  associated  with  the  PQ-list  is  changed  as  each  element  is  considered.  The 
correct  PMC  is  available  at  the  end  of  the  procedure.  Moreover,  the  start  and 
finish  elements  are  deleted  from  the  data-structure  (using  DeletePQL)  when  they 
are  enumerated. 


B.3.  PROCEDURE  DELETEPI 
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Procedure  InsertPI(PI,  PQList); 

begin 

HI «—  PI. Processor  Load; 

/*  PMC  =Current  value  of  processor  multiplexing  cost  */ 

StartElem  *-  InsertPQ L(start(PI),  start(PI).Time); 

FinishElem  <-  InsertPQL(finish(PI),  finish(PI).Time); 

If  StartElem.Prev  nil  then  begin 

/*  The  prev  field  is  the  previous  element  in  sorted  order.  */ 

CLoadPrev  *—  StartElem. Prev.C Load; 

C LoadPrevOld «—  StartElem.Prev .C Load; 

TimePrev  <—  StartElem. Prev.Time 

end 

else  begin 

CLoadPrev  <—  0; 

C LoadPrevOld  <—  0; 

TimePrev  <—  0 

end; 

For  all  Elem  6  EnumeratePQ L(  PQList ,  StartElem ,  FinishElem)  do  begin 
PMC  * —  PMC+  [max (0,  CLoadPrev  —  l)  —  max(0,C LoadPrevOld  —  l)]x 
( Elem.Time  —  TimePrev); 

C LoadPrevOld  <—  Elem.C Load; 

If  Elem  ^  FinishElem  then 

Elem.C  Load  <—  Elem.C  Load  +  ff/; 

CLoadPrev  *—  Elem.C  Load; 

TimePrev  *—  Elem.Time 
end  /*  for  */ 
end;  /*  InsertPI */ 


Figure  36:  Procedure  InsertPI 
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Procedure  DeletePI(Eleml ,  Elem2,  PQList); 
begin 

HI  *-  PI  .Processor  Load] 

/*  PMC  =Current  value  of  processor  multiplexing  cost  */ 

If  Eleml.Prev ,C Load  ^  nil  then  begin 
C LoadPrev  < —  Eleml. Prev.C  Load] 

CLoadPrevOld  <-  Eleml.Prev.C Load] 

TimePrev  <—  Eleml.Prev. Time 

end 

else  begin 

C LoadPrev  <—  0; 

CLoadPrevOld  <—  0; 

TimePrev  <—  0 

end; 

For  all  Elem  G  EnumeratePQL(  PQList,  Eleml,  Elem2)  do  begin 

PMC  <-  PMC  A  [max(0,C  LoadPrev  -  1)  -  max(0,  CLoadPrevOld  -  1)] 
(Elem.T  ime  —  TimePrev)] 

C LoadPrevOld  =  Elem.C  Load] 

If  Elem  ±  Elem2  then 

Elem.C  Load  —  Elem.C  Load  —  HI] 

C LoadPrev  =  Elem.C  Load] 

If  Elem  =  Eleml  or  Elem  =  Elem2  then 
DeletePQ L(PQ List,  Elem)] 

TimePrev  <—  Elem.T ime] 

end  /*  for  */ 
end;  /*  DeletePI*/ 


Figure  37:  Procedure  DeletePI 
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Greedy  Allocation  is  not  Locally 
Optimal 


This  appendix  presents  an  example  where  greedy  allocation  does  not  produce  a 
locally  optimal  solution.  Consider  the  dataflow*  graph  in  figure  38  and  the  processor 
topology  shown  in  figure  39.  Assume  that  there  is  no  processing  overlap  between 
nodes  A  or  B  with  C.  Also,  let  there  be  no  overlap  between  nodes  B  or  C  with  D.  Let 
the  amounts  of  communication  between  the  node  pairs  A  and  B  and  separately  A 
and  C  be  very  low  and  equal  to  each  other.  Also,  let  the  amounts  of  communication 
between  the  node  pairs  B  and  D  and  separately  C  and  D  be  very  high  and  equal  to 
each  other.  If  the  greedy  allocation  algorithm  allocates  the  nodes  in  the  topological 
order  A,  B,  C,  and  then  D,  a  possible  allocation  may  be  as  given  below: 

A  — ♦  1 

B  — ♦  1 
C  — ♦  2 
D  — ♦  1 

When  B  and  C  get  allocated  by  the  greedy  allocation  procedure,  the  only  com¬ 
munication  considered  is  from  A  to  B  and  C.  However,  this  is  not  necessarily  the 
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Figure  38:  A  Dataflow  Graph 


Figure  39:  A  Processor  Topology 
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locally  optimal  solution.  If  communication  from  B  and  C  to  D  is  high  enough  rela¬ 
tive  to  the  communication  from  A  to  B  and  C,  then  it  may  reduce  the  cost  function 
by  moving  C  from  processor  2  to  1.  Although  the  processor  multiplexing  cost  is 
increased,  the  effect  due  to  reduction  of  communication  cost  may  be  greater. 


170  APPENDIX  C.  GREEDY  ALLOCATION  IS  NOT  LOCALLY  OPTIMAL 


Appendix  D 


Local  Minimization  is  not 
Globally  Optimal 


This  appendix  presents  an  example  where  local  minimization  of  an  allocation  does 
not  produce  a  globally  optimal  solution.  Consider  the  dataflow*  graph  in  figure  40 
and  the  processor  topology  shown  in  figure  41.  Assume  that  there  is  no  processing 
overlap  between  node  A  with  any  of  the  nodes  in  the  set  {B,  C,  D,  E}.  Also,  assume 
that  there  is  no  overlap  between  any  of  the  nodes  in  the  set  {B,  C,  D,  E}  with  node 
F.  Let  the  amounts  of  communication  from  A  with  any  node  in  the  set  {B,  C,  D, 
E}  be  equal  and  very  low.  Let  the  amounts  of  communication  from  any  node  in  {B, 

C,  D,  E}  with  node  F  be  equal  and  very  large.  Assume,  in  addition,  that  memory 
requirements  dictate  that  at  most  one  node  may  be  allocated  to  a  single  processor. 
If  the  greedy  allocation  algorithm  allocates  nodes  in  the  topological  order  A,  B,  C, 

D,  E,  and  then  F,  a  possible  allocation  may  be  as  given  below: 

A  — ►  1 

B  — >  2 
C  — >4 
D  — >6 
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Figure  40:  A  Dataflow*  Graph 


Figure  41:  A  Processor  Topology 


The  allocation  produced  by  the  greedy  allocation  procedure  is  already  a  locally 
optimal  solution  because  any  feasible  local  neighbor  has  a  higher  cost.  However 
it  is  easy  to  see  that  this  not  necessarily  the  globally  optimal  solution.  G.ven  that 
communication  from  nodes  in  the  set  {B,  C,  D,  E}  to  F  is  high  enough  compared 
to  the  communication  from  A  to  the  nodes  in  the  set  {B,  C,  D,  E},  t  en  a  ower 
cost  solution  is  given  below: 

F  — *1 


B 


C  — ♦  4 


It  is  interesting  to  note  that  if  the  order  chosen  for  greedy  allocation  had  been 
reversed,  this  lower  cost  allocation  would  have  been  the  one  generated. 

Examples  can  also  be  generated  that  are  locally  optimal  but  not  globally  optimal 

and  where  the  size  of  memory  is  not  an  issue. 
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Appendix  E 
Adder  Example 


E.l  Syntax  and  Notation 

Anything  followed  by  on  a  line  is  a  comment.  The  syntax  for  facts  and  rules 
in  MRS  is  different  from  the  standard  Prolog  syntax.  Variables  are  symbols  that 
begin  with  the  character  A  literal  in  Prolog  as  in 
<predicate>(<  fieldl  > ,  <  field2  >,...,  <fieldn  > ) 
is  written  in  MRS  as 

(<predicate>  <fieldl>  <field2>  ...  <fieldn>). 

A  rule  in  Prolog  as  in 

<goal>  <subgoall>,<subgoal2>,...,<subgoaln> 
is  written  in  MRS  as 

(if  (and  <subgoall>  <subgoal2>  ...  <subgoaln>)  <goal>). 

In  addition,  a  fact  at  compile-time  is  represented  in  a  different  way  than  at  run¬ 
time.  Compile-time  facts  are  written  as 
(fact  <run-time-fact>  <list-of-variables>  <number>). 

<list-of-variables>  indicates  that  all  the  variables  in  the  list  are  actually  un¬ 
known  constants  in  the  run-time  fact  <run-time-fact>.  <number>  is  the  number 
of  facts  matching  this  fact  pattern  that  are  expected  to  be  present  at  run-time. 
The  device  whose  structure  and  behavior  is  captured  here  is  called  “F00”.  It 
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is  a  4-bit  adder.  The  database  contains  many  literals  of  the  form 
(VALUE  (PORT  <port-name>  <device>)  <value>). 

This  is  intended  to  mean  that  the  value  of  the  specified  port  <port>  of  device 
<device>  is  <value>.  <device>  is  either  the  top-level  device  “F00”  or  parts  of 
it  specified  in  a  hierarchical  fashion.  (PART  (NUM  FA  i)  F00)  for  i  from  1  to  4 
stands  for  the  ith  1-bit  full  adder.  Each  1-bit  full  adder  is  composed  of  5  gates:  1 
or  gate,  2  exclusive-or  gates,  and  2  and  gates.  An  example  of  a  device  at  this  lowest 
level  is  (PART  0R1  (PART  (NUM  FA  4.)  F00))  This  represents  the  1st  or  gate 
(0R1)  of  the  4th  1-bit  full  adder  (FA)  of  the  top-level  device  (F00). 


E.2  Adder  Database 

E.2.1  Adder  Database  at  Run-Time 

» » ; ; ; 5  * • » » ; » » * » * » » • » 5 ; ; ; ; * ;  • » * » 5  » » * ; ; ; ;  * ;  #  J » * #  *  * » *  • » » * » *  *  • » » 

;;;  External  inputs  at  mn-tias 

(VALUE  (PORT  III  (PART  (IUH  FA  1.)  F00))  1.) 

(VALUE  (PORT  III  (PART  (IUH  FA  2.)  FDO))  1.) 

(VALUE  (PORT  III  (PART  (IUH  FA  3.)  F00))  1.) 

(VALUE  (PORT  III  (PART  (IUH  FA  4.)  F00))  1.) 

(VALUE  (PORT  II2  (PART  (IUH  FA  1.)  F00))  1.) 

(VALUE  (PORT  112  (PART  (IUH  FA  2.)  F0Q>)  0.) 

(VALUE  (PORT  112  (PART  (IUH  FA  3.)  F00))  0.) 

(VALUE  (PORT  112  (PART  (IUH  FA  4.)  FOQ))  0.) 


(VALUE  (PORT  CH  (PART  (IUH  FA  1.)  F00))  0.) 
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; ; ;  End  of  •xtsrnal  inputs 


(IF  (VALUE  (POET  OUT  (PART  0E1  (PART  (IUK  FA  1.)  FOO)))  $368.) 
(VALUE  (PORT  COUT  (PART  (EUH  FA  1.)  FOO))  $368.)) 

(IF  (VALUE  (PORT  OUT  (PART  0R1  (PART  (IUK  FA  2.)  FOO)))  $366.) 
(VALUE  (PORT  COUT  (PART  (IUK  FA  2.)  FOO))  $368.)) 

(IF  (VALUE  (PORT  OUT  (PART  0R1  (PART  (IUK  FA  3.)  FOO)))  $364.) 
(VALUE  (PORT  COUT  (PART  (IUK  FA  3.)  FOO))  $364.)) 

(IF  (VALUE  (PORT  OUT  (PART  OR1  (PART  (IUK  FA  4.)  FOO)))  $362.) 
(VALUE  (PORT  COUT  (PART  (IUK  FA  4.)  FOO))  $362.)) 

(IF  (VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUK  FA  1.)  FOO)))  $360.) 
(VALUE  (PORT  SUH  (PART  (IUK  FA  1.)  FOO))  $360.)) 

(IF  (VALUE  (PORT  OUT  (PART  Z0R2  (PART  (IUK  FA  2.)  FOO)))  $348.) 
(VALUE  (PORT  SUH  (PART  (IUK  FA  2.)  FOO))  $348.)) 

(IF  (VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUK  FA  3.)  FOO)))  $346.) 
(VALUE  (PORT  SUH  (PART  (IUK  FA  3.)  FOO))  $346.)) 

(IF  (VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUK  FA  4.)  FOO)))  $344.) 
(VALUE  (PORT  SUH  (PART  (IUK  FA  4.)  FOO))  $344.)) 

(IF  (VALUE  (PORT  CII  (PART  (IUK  FA  1.)  FOO))  $342.) 

(VALUE  (PORT  III  (PART  AID2  (PART  (IUK  FA  1.)  FOO)))  $342.)) 

(IF  (VALUE  (PORT  CII  (PART  (IUK  FA  2.)  FOO))  $340.) 

(VALUE  (PORT  III  (PART  AID2  (PART  (IUK  FA  2.)  FOO)))  $340.)) 

(IF  (VALUE  (PORT  CII  (PART  (IUK  FA  3.)  FOO))  $338.) 

(VALUE  (PORT  III  (PART  AIM  (PART  (IUK  FA  3.)  FOO)))  $338.)) 

(IF  (VALUE  (PORT  CII  (PART  (IUK  FA  4.)  FOO))  $336.) 

(VALUE  (PORT  III  (PART  AIM  (PART  (IUK  FA  4.)  FOO)))  $336.)) 
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(IF  (VALUE  (POET  CII  (PIET  (IUH  FA  1.)  F00))  $334.) 

(VALUE  (POET  112  (PAET  X0E2  (PAET  (IUH  FA  1.)  F00)))  $334.)) 

(IF  (VALUE  (POET  CII  (PAET  (IUH  FA  2.)  F00))  $332.) 

(VALUE  (POET  112  (PAET  X0E2  (PAET  (IUH  FA  2.)  F00)))  $332.)) 

(IF  (VALUE  (POET  CII  (PAET  (IUH  FA  3.)  F00))  $330.) 

(VALUE  (POET  112  (PAET  X0E2  (PAET  (IUH  FA  3.)  FOO)))  $330.)) 

(IF  (VALUE  (POET  CII  (PAET  (IUH  FA  4.)  FOO))  $328.) 

(VALUE  (POET  112  (PAET  X0E2  (PAET  (IUH  FA  4.)  FOO)))  $328.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  1.)  FOO))  $328.) 

(VALUE  (POET  XV2  (PAET  AID1  (PAET  (IUH  FA  1.)  FOO)))  $326.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  2.)  FOO))  $324.) 

(VALUE  (POET  112  (PAET  AID1  (PAET  (IUH  FA  2.)  FOO)))  $324.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  3.)  FOO))  $322.) 

(VALUE  (POET  112  (PAET  AID1  (PAET  (IUH  FA  3.)  FOO)))  $322.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  4.)  FOO))  $320.) 

(VALUE  (POET  I 12  (PAET  AID1  (PAET  (IUH  FA  4.)  FOO)))  $320.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  1.)  FOO))  $318.) 

(VALUE  (POET  112  (PAET  X0E1  (PAET  (IUH  FA  1.)  FOO)))  $318.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  2.)  FOO))  $316.) 

(VALUE  (POET  112  (PAET  X0E1  (PAET  (IUH  FA  2.)  FOO)))  $316.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  3.)  FOO))  $314.) 

(VALUE  (POET  112  (PAET  X0E1  (PAET  (IUH  FA  3.)  FOO)))  $314.)) 

(IF  (VALUE  (POET  112  (PAET  (IUH  FA  4.)  FOO))  $312.) 

(VALUE  (POET  112  (PAET  X0E1  (PAET  (IUH  FA  4.)  FOO)))  $312.)) 


(IF  (VALUE  (POET  III  (PAET  (IUH  FA  1.)  FOO))  $310.) 
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(VALUE 

(PORT 

III 

(PART 

AVD1 

(PART 

(IUH  FA  1.)  FOO))) 

$310.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(IUH 

FA  3.) 

FOO))  1308.) 

(VALUE 

(PORT 

III 

(PART 

AVD1 

(PART 

(IUH  FA  3.)  FOO))) 

$308.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(VUH 

FA  3.) 

FOO))  $308.) 

(VALUE 

(PORT 

III 

(PART 

AVD1 

(PART 

(IUH  FA  3.)  FOO))) 

$306.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(VUH 

FA  4.) 

FOO))  $304.) 

(VALUE 

(PORT 

III 

(PART 

AVD1 

(PART 

(IUH  FA  4.)  FOO))) 

$304.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(VUH 

FA  1.) 

FOO))  $303.) 

(VALUE 

(PORT 

Ill 

(PART 

XOR1 

(PART 

(IUH  FA  1.)  FOO))) 

$303.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(VUH 

FA  3.) 

FOO))  $300.) 

(VALUE 

(PORT 

III 

(PART 

ZOR1 

(PART 

(IUH  FA  3.)  FOO))) 

$300.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(VUH 

FA  3.) 

FOO))  $398.) 

(VALUE 

(PORT 

III 

(PART 

X0R1 

(PART 

(IUH  FA  3.)  FOO))) 

$398.)) 

(IF 

(VALUE 

(PORT 

III 

(PART 

(VUH 

FA  4.) 

FOO))  $396.) 

(VALUE 

(PORT 

III 

(PART 

XORl 

(PART 

(IUH  FA  4.)  FOO))) 

$396.)) 

(IF 

(VALUE 

(PORT 

COUT  (PART  (IUH  FA  1. 

)  FOO))  $37$. ) 

(VALUE 

(PORT 

CIV 

(PART 

(VUH  FA  2.) 

FOO))  $375.)) 

(IF 

(VALUE 

(PORT 

COUT  (PART  (IUH  FA  2. 

)  FOO))  $373.) 

(VALUE 

(PORT 

CIV 

(PART 

(IUH  FA  3.) 

FOO))  $373.)) 

(IF 

(VALUE 

(PORT 

COUT  (PART  (IUH  FA  3. 

)  FOO))  $371.) 

(VALUE 

(PORT 

CIV 

(PART 

(VUH  FA  4.) 

FOO))  $371.)) 

(IF 

(VALUE 

(PORT 

OUT 

(PART 

AVD2 

(PART 

(IUH  FA  1.)  FOO))) 

$369.) 

(VALUE 

(PORT 

III 

(PART 

OR1 

(PART  (VUH  FA  1.)  FOO))) 

$369.)) 

(IF 

(VALUE 

(PORT 

OUT 

(PART 

AVD1 

(PART 

(VUH  FA  1.)  FOO))) 

$367.) 

(VALUE 

(PORT 

IV2 

(PART 

OR1 

(PART  (IUH  FA  1.)  FOO))) 

$387.)) 
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(IF  (VALUE  (POET  OUT  (PART  I0R1  (PART  (1UH  FA  1.)  FOO)))  $265.) 

(VALUE  (PORT  112  (PART  AID2  (PART  (IUH  FA  1.)  FOO)))  $265.)) 

(IF  (VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  1.)  FOO)))  $263.) 

(VALUE  (PORT  III  (PART  I0R2  (PART  (IUH  FA  1.)  FOO)))  $263.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  AID2  (PART  (IUH  FA  1.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AID2  (PART  (IUH  FA  1.)  FOO)))  $261.)) 
(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  1.)  FOO)))  $261.)) 

(IF  (VALUE  (PORT  III  (PART  AID2  (PART  (IUH  FA  1.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  1.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  AID1  (PART  (IUH  FA  1.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AID1  (PART  (IUH  FA  1.)  FOO)))  $258.)) 

(VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  1.)  FOO)))  $258.)) 

(IF  (VALUE  (PORT  III  (PART  AID1  (PART  (IUH  FA  1.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  1.)  FOO)))  0.)) 


(IF  (AID  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  1.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  0R1  (PART  (IUH  FA  1.)  FOO)))  $255.)) 
(VALUE  (PORT  OUT  (PART  0R1  (PART  (IUH  FA  1.)  FOO)))  $255.)) 

(IF  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  1.)  FOO)))  1.) 

(VALUE  (PORT  OUT  (PART  0R1  (PART  (IUH  FA  1.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  1.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  I0R2  (PART  (IUH  FA  1.)  FOO)))  1.)) 
(VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUH  FA  1.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  IQR2  (PART  (IUH  FA  1.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  1.)  FOO)))  1.)) 

(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  1.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  1.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  1.)  FOO)))  0.)) 

(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  1.)  FOO)))  1.)) 
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(IF  (AID  (TALUS  (POST  HI  (PAST  XOE2  (PAST  (IUH  PA  1.)  POO)))  0.) 

(VALUE  (POST  112  (PAST  X0S2  (PAST  (SUM  PA  1.)  POO)))  0.)) 

(VALUE  (POST  OUT  (PAST  I0S2  (PAST  (IUH  PA  1.)  POO)))  0.)) 

(IP  (AID  (VALUE  (POST  HI  (PAST  I0S1  (PAST  (IUH  PA  1.)  POO)))  1.) 

(VALUE  (POST  H3  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  1.)) 

(VALUE  (POST  OUT  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  0.)) 

(IF  (AID  (VALUE  (POST  III  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  0.) 

(VALUE  (POST  112  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  1.)) 

(VALUE  (POST  OUT  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  1.)) 

(IF  (AID  (VALUE  (POST  HI  (PAST  X0R1  (PAST  (IUH  PA  1.)  POO)))  1.) 

(VALUE  (POST  112  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  0.)) 

(VALUE  (POST  OUT  (PAST  I0S1  (PAST  (IUH  PA  1.)  POO)))  1.)) 

(IP  (AID  (VALUE  (POST  HI  (PAST  Z0S1  (PAST  (IUH  PA  1.)  POO)))  0.) 

(VALUE  (POST  in  (PAST  I0S1  (PAST  (IUH  PA  1.)  POO)))  0.)) 

(VALUE  (POST  OUT  (PAST  X0S1  (PAST  (IUH  PA  1.)  POO)))  0.)) 

(IP  (VALUE  (POST  OUT  (PAST  AID2  (PAST  (IUH  FA  2.)  POO)))  $244.) 

(VALUS  (POST  III  (PAST  0S1  (PAST  (IUH  PA  2.)  POO)))  1244.)) 

(IP  (VALUE  (POST  OUT  (PAST  AID1  (PAST  (IUH  FA  2.)  POO)))  $242.) 

(VALUB  (POST  II2  (PAST  0S1  (PAST  (IUH  PA  2.)  POO)))  $242.)) 

(IP  (VALUE  (POET  OUT  (PAST  Z0S1  (PAST  (IUH  PA  2.)  POO)))  $240.) 

(VALUE  (POST  112  (PAST  AIDS  (PAST  (IUH  FA  2.)  POO)))  $240.)) 

(IP  (VALUE  (POST  OUT  (PAST  Z0S1  (PAST  (IUH  PA  2.)  POO)))  $238.) 

(VALUE  (POST  III  (PAST  Z0S2  (PAST  (IUH  PA  2.)  POO)))  $238.)) 

(IP  (AID  (VALUE  (POST  ni  (PAST  AID2  (PAST  (IUH  PA  2.)  POO)))  1.) 

(VALUE  (POST  112  (PAST  AIDS  (PAST  (IUH  PA  2.)  POO)))  $238.)) 

(VALUE  (POST  OUT  (PAST  AID2  (PART  (IUH  FA  2.)  FOO)))  $238.)) 

(IF  (VALUB  (PORT  III  (PART  AID2  (PART  (IUH  PA  2.)  FOO)))  0.) 
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(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  2.)  F00)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  AID1  (PART  (IUH  FA  2.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AID1  (PART  (IUH  FA  2.)  FOO)))  $233.)) 
(VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  2.)  FOO)))  $233.)) 

(IF  (VALUE  (PORT  III  (PART  AID1  (PART  (IUH  FA  2.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  2.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  2.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  0R1  (PART  (IUH  FA  2.)  FOO)))  $230.)) 

(VALUE  (PORT  OUT  (PART  OR!  (PART  (IUH  FA  2.)  FOO)))  $230.)) 

(IF  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  2.)  FOO)))  1.) 

(VALUE  (PORT  OUT  (PART  0R1  (PART  (IUH  FA  2.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  1.)) 
(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  1.)) 
(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  0.)) 
(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  0.)) 
(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  2.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R1  (PART  (IUH  FA  2.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R1  (PART  (IUH  FA  2.)  FOO)))  1.)) 
(VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  2.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  I0R1  (PART  (IUH  FA  2.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  TORI  (PART  (IUH  FA  2.)  FOO)))  1.)) 
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(VALUE  (PORT  OUT  (PART  XORi  (PART  (IUH  PA  2.)  POO)))  1.)) 

(IP  (AID  (VALUE  (PORT  III  (PART  XORI  (PART  (IUH  PA  2.)  POO)))  1.) 

(VALUE  (PORT  112  (PART  XORI  (PART  (IUH  FA  2.)  POO)))  0.)) 
(VALUE  (PORT  OUT  (PART  XORI  (PART  (IUH  FA  2.)  POO)))  A.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  XORI  (PART  (IUH  FA  2.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  XORI  (PART  (IUH  FA  2.)  POO)))  0.)) 
(VALUE  (PORT  OUT  (PART  XORI  (PART  (IUH  FA  2.)  POO)))  0.)) 

(IF  (VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  3.)  FOO)))  $219.) 

(VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  3.)  FOO)))  $219.)) 

(IF  (VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  3.)  FOO)))  $217.) 

(VALUE  (PORT  112  (PART  0R1  (PART  (IUH  FA  3.)  FOO)))  $217.)) 

(IF  (VALUE  (PORT  OUT  (PART  XORI  (PART  (IUH  FA  3.)  FOO)))  $215.) 

(VALUE  (PORT  112  (PART  AID2  (PART  (IUH  PA  3.)  POO)))  $215.)) 

(IP  (VALUE  (PORT  OUT  (PART  XORI  (PART  (IUH  PA  3.)  POO)))  $213.) 

(VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  $213.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  AID2  (PART  (IUH  FA  3.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AID2  (PART  (IUH  FA  3.)  FOO)))  $211.)) 

(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  3.)  FOO)))  $211.)) 

(IF  (VALUE  (PORT  III  (PART  AID2  (PART  (IUH  FA  3.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  AID1  (PART  (IUH  PA  3.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AID1  (PART  (IUH  FA  3.)  FOO)))  $208.)) 

(VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  3.)  FOO)))  $208.)) 

(IF  (VALUE  (PORT  III  (PART  AID1  (PART  (IUH  PA  3.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AID1  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  3.)  FOO)))  0.) 

(VALUE  (PORT  II2  (PART  0R1  (PART  (IUH  FA  3.)  FOO)))  $205.)) 
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(VALUE  (PORT  OUT  (PART  OR1  (PART  (IUH  FA  3.)  FOO)))  *206.)) 

(IF  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  3.)  FOO))>  1.) 

(VALUE  (PORT  OUT  (PART  0R1  (PART  (EUH  FA  3.)  FOO)»  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  1.) 

(VALUE  (PORT  II2  (PART  I0R2  (PART  (IUH  FA  3.)  FOO)))  1.)) 
(VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  HI  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  0.) 

(VALUE  (PORT  II2  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  I0R2  (PART  (IUH  FA  3.)  FOO)))  1.) 

(VALUE  (PORT  in  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  HI  (PART  I0R1  (PART  (IUH  FA  3.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  I0R1  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(VALUE  (PORT  OUT  (PART  I0R1  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  I0R1  (PART  (IUH  FA  3.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  0.)) 
(VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  0.) 

(VALUB  (PORT  112  (PART  X0R1  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(VALUE  (PORT  OUT  (PART  I0R1  (PART  (IUH  FA  3.)  FOO)))  0.)) 

(IF  (VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  4.)  F00»)  »194.) 
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(VALUE  (PORT  III  (PART  OR1  (PART  (IUH  FA  4.)  FOO)))  $194.)) 

(IF  (VALUE  (PORT  OUT  (PART  AIDl  (PART  (IUH  FA  4.)  FOO)))  $192.) 

(VALUE  (PORT  112  (PART  OR1  (PART  (IUH  FA  4.)  FOO)))  $192.)) 

(IF  (VALUE  (PORT  OUT  (PART  XOR1  (PART  (IUH  FA  4.)  FOO)))  $190.) 

(VALUE  (PORT  112  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  $190.)) 

(IF  (VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  $188.) 

(VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  $188.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  $186.)) 

(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  $186.)) 

(IF  (VALUE  (PORT  III  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  31  (PART  AIDl  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  AIDl  (PART  (IUH  FA  4.)  FOO)))  $183.)) 
(VALUE  (PORT  OUT  (PART  AIDl  (PART  (IUH  FA  4.)  FOO)))  $183.)) 

(3  (VALUE  (PORT  III  (PART  AIDl  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUE  (PORT  OUT  (PART  AIDl  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  0R1  (PART  (IUH  FA  4.)  FOO)))  $180.)) 

(VALUE  (PORT  OUT  (PART  0R1  (PART  (IUH  FA  4.)  FOO)))  $180.)) 

(IF  (VALUE  (PORT  III  (PART  0R1  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  OUT  (PART  0R1  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  1.)) 
(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  1.)) 
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(VALUE  (POET  OUT  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  I0R2  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R2  (PART  (IU1C  FA  4.)  FOO)))  0.)) 
(VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(IF  (AID  (VALUB  (PORT  III  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUE  (PORT  II2  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(IF  (AID  (VALUB  (PORT  III  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  in  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(IF  (AID  (VALUB  (PORT  III  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUB  (PORT  112  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(VALUB  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(IF  (AID  (VALUE  (PORT  III  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  1.) 

(VALUE  (PORT  112  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(VALUB  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  1.)) 

(IP  (AID  (VALUE  (PORT  III  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  0.) 

(VALUB  (PORT  112  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  0.)) 

(VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  0.)) 


E.2.2  Adder  Database  at  Compile-Time 

The  rules  are  the  same  as  the  rules  in  the  run-time  database  and  will  not  be  repeated 
here.  The  facts  are  written  in  a  different  fashion  and  these  are  given  below. 


;;;  Ertamml  input*  at  coapil*-tima 

(fact  (VALUB  (PORT  III  (PART  (IUH  FA  1.)  FOO))  lx)  (*x)  1) 
(fact  (VALUB  (PORT  III  (PART  («*  F1  2->  F00))  ($x)  15 
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(fact  (VALUE  (POET  III  (PAST  (IUH  FA  3.)  F00))  *x)  (»x)  1) 
(fact  (VALUE  (POET  III  (PAET  (IUK  FA  4.)  FOO))  $x>  (»x)  1) 
(fact  (VALUE  (POET  112  (PAET  (IUH  FA  1.)  FOO))  *x)  (Ax)  1) 
(fact  (VALUE  (POET  112  (PAET  (IUH  FA  2.)  FOO))  Ax)  (Ax)  1) 
(fact  (VALUE  (POET  112  (PAET  (IUH  FA  3.)  F00)>  lx)  (Ax)  1) 
(fact  (VALUE  (POET  112  (PAET  (IUH  FA  4.)  FOO))  Ax)  (Ax)  1) 
(fact  (VALUE  (POET  CII  (PAET  (IUH  FA  1.)  FOO))  Ax)  (Ax)  1) 
;;;  End  of  •ztomnl  input* 


E.3  Goal 

E.3.1  Goal  at  Run-Time 

The  goal  is  to  determine  the  value  of  the  “COUT”  port  of  the  fourth  full-adder 
in  the  top-level  device  “FOO”.  The  fourth  full- adder  is  associated  with  the  highest 
order  bit.  The  goal  is  given  below. 

(VALUE  (PORT  COUT  (PART  (NUM  FA  4)  FOO))  $X) 

E.3.2  Goal  at  Compile-Time 

The  syntax  for  goals  at  compile-time  is  similar  to  that  for  facts  at  compile-time. 
“GOAL”  is  the  predicate  as  opposed  to  “FACT”.  The  first  term  is  the  fact  and 
the  next  two  terms  are  the  list  of  unknown  constants  and  the  number  of  goals 
respectively.  The  goal  is  shown  below. 

(GOAL  (VALUE  (PORT  COUT  (PART  (NUM  FA  4.)  FOO))  $X)  NIL  1.) 
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E.4  Domain  Information 


The  cardinalities  of  the  domains  of  all  variables  is  2  because  variables  can  be  bound 
to  either  0  or  1. 


E.5  Partitioning  Database 

In  this  database,  a  fact  of  the  form 
(PARTITION  <FACT-PATTERN >) 

indicates  that  all  facts  that  match  the  fact  pattern  < FACT-PATTERN  >  or  all  rules 
whose  consequents  (or  heads)  match  the  fact  pattern  are  included  in  one  partition. 
Variables  are  now  all  symbols  that  begin  with  The  notation  for  variables 

is  different  here  because  there  are  two  types  of  variables  in  MRS.  Variables  that 
begin  with  “$”  are  base-level  variables  and  variables  that  begin  with  are  meta¬ 
level  variables.  The  base-level  database  describes  the  application  of  interest  and 
the  meta-level  describes  information  about  the  base-level.  The  distinction  is  not 
terribly  important  here  except  that  the  partitioning  database  is  better  thought  of 
as  containing  meta-level  information.  The  partitioning  database  is  shown  below. 


(PARTITIOI  (VALUE  (PORT  III  (PART  (IUH  FA  1)  F00))  AX)) 

( PARTI TXOV  (VALUE  (PORT  III  (PART  (IUH  FA  2)  FGO))  AX)) 
(PARTITIOI  (VALUE  (PORT  III  (PART  (IUH  FA  3)  FOO))  AX)) 

( PARTI TIOI  (VALUE  (PORT  III  (PART  (IUH  FA  4)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  112  (PART  (IUH  FA  1)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  112  (PART  (IUH  FA  2)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  112  (PART  (IUH  FA  3)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  112  (PART  (IUH  FA  4)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  CII  (PART  (IUH  FA  1)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  COUT  (PART  (IUH  FA  1)  FOO))  AX)) 

(PARTITIOI  (VALUE  (PORT  COUT  (PART  (IUH  FA  2)  FOO))  AI)) 

(PARTITIOI  (VALUE  (PORT  COUT  (PART  (IUH  FA  3)  FOO))  AX)) 

(PARTITIOI  (VALUE  (PORT  COUT  (PART  (IUH  FA  4)  FOO))  AX)) 

(PARTITIOI  (VALUE  (PORT  SUH  (PART  (IUH  FA  1)  FOO))  AX)) 
(PARTITIOI  (VALUE  (PORT  SUH  (PART  (IUH  FA  2)  FOO))  AX)) 
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(PARTITIOI 

(VALUE 

(PORT  SUH 

(PART 

(PARTXTXOI 

(VALUE 

(PORT  SUH 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  II2 

(PART 

(PARTITIOI 

(VALUE 

(PORT  II2 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  II2 

(PART 

(PARTITIOI 

(VALUE 

(PORT  II2 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  II2 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  II2 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  CII 

(PART 

(PARTITIOI 

(VALUE 

(PORT  CII 

(PART 

(PARTITIOI 

(VALUE 

(PORT  CII 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  112 

(PART 

(PARTITIOI 

(VALUE 

(PORT  III 

(PART 

(PARTITIOI 

(VALUE 

(PORT  OUT 

(PART 

(PARTITIOI 

(VALUE 

(PORT  OUT 

(PART 

(PARTITIOI 

(VALUE 

(PORT  OUT 

(PART 

(PARTITIOI 

(VALUE 

(PORT  OUT 

(PART 

(PARTITIOI 

(VALUE 

(PORT  OUT 

(PART 

(IUH  N  3)  POO))  AX)) 

(IUK  Pi  4)  POO))  41)) 

UD3  (PART  (IUH  Pi  1)  POO)))  41)) 
iID3  (PART  (IUH  Pi  2)  POO)))  AX)) 
AID2  (PART  (IUH  Pi  3)  POO)))  AX)) 
AID2  (PART  (IUH  Pi  4)  POO)))  41)) 
X0R2  (PART  (IUH  Pi  1)  POO)))  AX)) 
X0R2  (PART  (IUH  Pi  2)  POO)))  AX)) 
X0R2  (PART  (IUH  Pi  3)  POO)))  AX)) 
X0R2  (PART  (IUH  PA  4)  POO)))  AX)) 
AID1  (PART  (IUH  PA  1)  POO)))  4X)) 
AID1  (PART  (IUH  PA  2)  POO)))  4X)) 
AID1  (PART  (IUH  PA  3)  POO)))  4X)) 
AID1  (PART  (IUH  PA  4)  POO)))  4X)) 
XOR1  (PART  (IUH  PA  1)  POO)))  4X)) 
XOR1  (PART  (IUH  PA  2)  POO)))  4X)) 
X0R1  (PART  (IUH  PA  3)  POO)))  4X)) 
XOR1  (PART  (IUH  PA  4)  POO)))  4X)) 
AID1  (PART  (IUH  FA  1)  POO)))  4X)) 
AID1  (PART  (IUH  PA  2)  POO)))  4X)) 
AIDl  (PART  (IUH  PA  3)  POO)))  AX)) 
AID1  (PART  (IUH  PA  4)  POO)))  AX)) 
XOR1  (PART  (IUH  PA  1)  POO)))  4X)) 
XOR1  (PART  (IUH  FA  2)  POO)))  AX)) 
X0R1  (PART  (IUH  PA  3)  POO)))  4X)) 
XOR1  (PART  (IUH  PA  4)  POO)))  4X)) 
(IUH  FA  2)  POO))  AX)) 

(IUH  FA  3)  POO))  4X) ) 

(IUH  FA  4)  FOO))  41) ) 

0R1  (PART  (IUH  FA  1)  POO)))  4X)) 
0R1  (PART  (IUH  FA  1)  POO)))  AX)) 
AID2  (PART  (IUH  PA  1)  POO)))  AX)) 
X0R2  (PART  (IUH  FA  1)  FOO)))  AX)) 
AID2  (PART  (IUH  FA  1)  FOO)))  AX)) 
AID1  (PART  (IUH  PA  1)  POO)))  AX)) 
OR1  (PART  (IUH  FA  1)  FOO)))  AX)) 
X0R2  (PART  (IUH  PA  1)  POO)))  AX)) 
X0R1  (PART  (IUH  FA  1)  FOO)))  AX)) 
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(PARTITIOI  (VALUE  (POET  III  (PART  0R1  (PART  (RUE  FA  2)  F00)))  AI)) 
(PARTITI01  (VALUE  (PORT  112  (PART  OR1  (PART  (IUM  FA  2)  FOO)))  *X)) 
(PARTITIOI  (VALUE  (PORT  II2  (PART  AID2  (PART  (IUM  FA  2)  FOO)))  *X)) 

(PARTITIOI  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUM  FA  2)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  AID2  (PART  (IUM  FA  2)  FOO)))  *X)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  AID1  (PART  (IUM  FA  2)  FOO)))  AI)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  OR1  (PART  (IUM  FA  2)  FOO)))  AX)) 
(PARTITIOI  (VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUM  FA  2)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUM  FA  2)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  III  (PART  0R1  (PART  (IUM  FA  3)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  112  (PART  0R1  (PART  (IUM  FA  3)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  II2  (PART  AID2  (PART  (IUM  FA  3)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUM  FA  3)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  AIB2  (PART  (IUM  FA  3)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  AID1  (PART  (IUM  FA  3)  FOO)))  AI)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  OR1  (PART  (IUM  FA  3)  FOO)))  AI)) 
(PARTITIOI  (VALUE  (PORT  OUT  (PART  XOR2  (PART  (IUM  FA  3)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  IOR1  (PART  (IUM  FA  3)  FOO)))  AI)) 

(PARTITIOI  (VALUE  (PORT  III  (PART  0R1  (PART  (IUM  FA  4)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  II2  (PART  0R1  (PART  (IUM  FA  4)  FOO)))  AI)) 

(PARTITIOI  (VALUE  (PORT  112  (PART  AID2  (PART  (IUM  FA  4)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUM  FA  4)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  AID2  (PART  (IUM  FA  4)  FOO)))  AI)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  AIB1  (PART  (IUM  FA  4)  FOO)))  AX)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  OR1  (PART  (IUM  FA  4)  FOO)))  AX)) 
(PARTITIOI  (VALUE  (PORT  OUT  (PART  I0R2  (PART  (IUM  FA  4)  FOO)))  AI)) 

(PARTITIOI  (VALUE  (PORT  OUT  (PART  IOR1  (PART  (IUM  FA  4)  FOO)))  AX)) 


E.6  Multiprocessor  Characteristics 


E.6.1  Size  of  Multiprocessor 

First,  the  number  of  processors  used  in  the  experiment  was  61.  This  corresponds  to 
an  E-size  of  5  (i.e.,  there  are  5  processors  on  each  side  of  the  hexagonal  surface).  The 
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Figure  42:  E-5  Processing  Surface  for  FAIM-1 

processors  along  with  their  processor  addresses  are  shown  in  figure  E.6.1.  Wrap¬ 
around  connections  from  the  edges  of  the  boundary  are  not  shown  in  the  figure  for 
the  sake  of  simplicity. 


E.6.2  Processing  Parameters 

For  the  cost  model  described  in  chapter  3,  the  only  constant  given  a  non-zero  value 
is  Ku — the  time  taken  to  perform  a  successful  unification.  This  constant  is  given 
the  value  50  microseconds  based  on  an  estimate  of  20  KLIPS  for  each  processor. 
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E.6.3  Communication  Parameters 

The  constants  KUK2,  and  K3  used  in  the  definition  of  the  communication  cost 
function  (see  chapter  3)  have  the  values  given  below. 

K\  =  2  microseconds 

K2  —  2  microseconds/packet 

K3  =  1  microseconds/hop 

All  messages  are  assumed  to  fit  in  one  packet. 


E.T  Allocation  Database 

E.7.1  Allocation  Database  for  Single  Copy  Case 

In  this  database,  facts  of  the  form 

(LOC  < FACT-PATTERN >  <PROCESSOR-ADDRESS>) 
are  intended  to  mean  that  the  partition  specified  by 
(PARTITION  <FACT-PATTERN>) 

should  be  allocated  to  the  processor  with  the  address  <PROCESSOR-ADDRESS>. 
The  database  is  shown  below. 


(LOC  (VALOR  (PORT  III  (PART  (IUH  FA  1.)  FOO))  AX)  (3.  T.)) 
(LOC  (VALOR  (PORT  III  (PART  (IOH  FA  2.)  F00>>  AX)  (2-  4.)) 

(LOC  (VALUE  (PORT  III  (PART  (IUH  FA  3.)  F00))  AX)  (2.  «.)> 

(LOC  (VALOR  (PORT  III  (PART  (IOH  FA  4.)  F00»  AX)  (S.  4.» 

(LOC  (VALOR  (PORT  H2  (PART  (IOH  FA  1.)  F00))  AX)  (#.  2.)) 

(LOC  (VALOR  (PORT  112  (PART  (IOH  FA  2.)  F00))  AX)  (S.  «.)) 

(LOC  (VALOR  (PORT  H2  (PART  (IOH  FA  3.)  F00)>  AX)  (3.  2.)) 

(LOC  (VALOR  (PORT  112  (PART  (IOH  FA  4.)  F00))  AI)  (1.  2.)) 

(LOC  (VALOR  (PORT  CII  (PART  (IOH  FA  1.)  F00))  AX)  (3.  6.)) 

(LOC  (VALOR  (PORT  COOT  (PART  (IUH  FA  1.)  F00)>  AX)  (8.  8.)) 

(LOC  (VALOR  (PORT  COOT  (PART  (IOH  FA  2.)  F00))  AX)  (6.  2.)) 

(LOC  (VALOR  (PORT  COOT  (PART  (IUH  FA  3.)  F00))  AX)  (0.  0.)) 

(LOC  (VALOR  (PORT  COOT  (PART  (IOH  FA  4.)  FOO))  AX)  (4.  4.)) 

(LOC  (VALOR  (PORT  SOH  (PART  (IOH  FA  1.)  FOO))  AX)  IIL) 
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(LOG  (VALUE  (POET  SUE  (PAET  (IUH  FA  2.)  FOO>>  AI)  IIL) 

(LOG  (VALUE  (POET  SUE  (PAET  (IUE  FA  3.)  FQO))  AX)  IIL) 

(LOC  (VALUE  (POET  SUB  (PAET  (IUE  FA  4.)  FOO))  AX)  IIL) 

(LOC  (VALUE  (POET  HI  (PAET  AID2  (PAET  (IUE  FA  1.)  FOO)))  AX)  (3.  S.)) 

(LOC  (VALUE  (POET  HI  (PAET  AIDS  (PAET  (IUE  FA  2,)  FOO)))  AX)  (6.  7.)) 

(LOC  (VALUE  (POET  III  (PAET  AID2  (PAET  (IUE  FA  3.)  FOO)))  AX)  (4.  1.)) 

(LOC  (VALUE  (POET  HI  (PAET  AID2  (PAET  (IUE  FA  4.)  FOO)))  AX)  (7.  4.)) 

(LOC  (VALUE  (POET  H2  (PAET  X0E2  (PAET  (IUE  FA  1.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (POET  H2  (PAET  XOE2  (PAET  (IUE  FA  2.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (POET  H2  (PAET  X0E2  (PAET  (IUE  FA  3.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (POET  H2  (PAET  X0E2  (PAET  (IUE  FA  4.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (POET  H2  (PAET  AID1  (PAET  (IUE  FA  1.)  FOO)))  AX)  (6.  2.)) 

(LOC  (VALUE  (POET  H2  (PAET  AID1  (PAET  (IUE  FA  2.)  FOO)))  AI)  (4.  S.)) 

(LOC  (VALUE  (POET  H2  (PAET  AID1  (PAET  (IUE  FA  3.)  FOO)))  AX)  (3.  1.)) 

(LOC  (VALUE  (POET  H2  (PAET  AID1  (PAET  (IUE  FA  4.)  FOO)))  AX)  (4.  4.)) 

(LOC  (VALUE  (POET  112  (PAET  XOE1  (PAET  (IUE  FA  1.)  FOO)))  AX)  (6.  2.)) 

(LOC  (VALUE  (POET  112  (PAET  XOE1  (PAET  (IUE  FA  2.)  FOO)))  AX)  (7.  0.)) 

(LOC  (VALUE  (POET  112  (PAET  X0E1  (PAET  (IUE  FA  3.)  FOO)))  AX)  (5.  1.)) 

(LOC  (VALUE  (POET  H2  (PAET  X0E1  (PAET  (IUE  FA  4.)  FOO)))  AX)  (0.  2.)) 

(LOC  (VALUE  (POET  HI  (PAET  AID1  (PAET  (IUE  FA  1.)  FOO)))  AX)  (2.  «.)) 

(LOC  (VALUE  (POET  III  (PAET  AID1  (PAET  (IUE  FA  2.)  FOO)))  AX)  (3.  5.)) 

(LOC  (VALUE  (POET  III  (PAET  AID1  (PAET  (IUE  FA  3.)  FOO)))  AX)  (4.  2.)) 

(LOC  (VALUE  (POET  HI  (PAET  AID1  (PAET  (IUE  FA  4.)  FOO)))  AX)  (6.  4.)) 

(LOC  (VALUE  (POET  III  (PAET  X0E1  (PAET  (IUE  FA  1.)  FOO)))  AX)  (3.  6.)) 

(LOC  (VALUE  (POET  HI  (PAET  X0E1  (PAET  (IUE  FA  2.)  FOO)))  AX)  (B.  5.)) 

(LOC  (VALUE  (POET  III  (PAET  X0E1  (PAET  (IUE  FA  3.)  FOO)))  AI)  (2.  5.)) 

(LOC  (VALUE  (POET  HI  (PAET  X0E1  (PAET  (IUE  FA  4.)  FOO)))  AX)  (0.  3.)) 

(LOC  (VALUE  (POET  CII  (PAET  (IUE  FA  2.)  FOO))  AX)  (7.  7.)) 

(LOC  (VALUE  (POET  CII  (PAET  (IUE  FA  3.)  FOO))  AX)  (6.  2.)) 

(LOC  (VALUE  (POET  CII  (PAET  (IUE  FA  4.)  FOO))  AX)  (8.  4.)) 

(LOC  (VALUE  (POET  HI  (PAET  0E1  (PAET  (IUE  FA  1.)  FOO)))  AX)  (1.  5.)) 
(LOC  (VALUE  (POET  H2  (PAET  0E1  (PAET  (IUE  FA  1.)  FOO)))  AX)  (4.  0.)) 
(LOC  (VALUE  (POET  112  (PAET  AID2  (PAET  (IUE  FA  1.)  FOO)))  AX)  (2.  5.)) 
(LOC  (VALUE  (POET  HI  (PAET  X0E2  (PAET  (IUE  FA  1.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (POET  OUT  (PAET  AID2  (PART  (IUE  FA  1.)  FOO)))  AX)  (2.  S.)) 
(LOC  (VALUE  (PORT  OUT  (PART  AID1  (PART  (IUE  FA  1.)  FOO)))  AX)  (5.  1.)) 
(LOC  (VALUE  (PORT  OUT  (PAET  0E1  (PART  (IUE  FA  1.)  FOO)))  AX)  (4.  0.)) 
(LOC  (VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUE  FA  1.)  FOO)))  AX)  IIL) 
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(LOC  (VALUE  (PORT  OUT  (P1RT  X0R1  (PiRT  (IUH  FA  1.)  TOO)))  AX)  (3.  6.)) 
(LOC  (VALUE  (PORT  Ml  (PiRT  OR1  (PiRT  (IUH  PA  2.)  FOO)))  AX)  (4.  7.)) 
(LOC  (ViLOE  (PORT  II2  (PiRT  OR1  (PiRT  (IUH  Pi  2.)  FOO)))  AX)  (3.  6.)) 
(LOC  (ViLOE  (PORT  II2  (PiRT  AID2  (PiRT  (IUH  FA  2.)  FOO)))  AX)  (S.  6.)) 
(LOC  (VALUE  (PORT  III  (PART  I0R2  (PART  (IUH  PA  2.)  FOQ)»  AX)  EIL) 

(LOC  (ViLOE  (PORT  OOT  (PiRT  AID2  (PiRT  (IOM  PA  2.)  POO)))  AX)  (5.  7.)) 
(LOC  (VALUE  (PORT  OOT  (PART  AID1  (PART  (IUH  PA  2.)  FOO)))  AX)  (4.  6.)) 
(LOC  (ViLOE  (PORT  OOT  (PiRT  OR1  (PART  (IUH  PA  2.)  POO)))  AI)  (3.  7.)) 
(LOC  (ViLOE  (PORT  OOT  (PART  X0R2  (PART  (IUH  PA  2.)  POO)))  AX)  IIL) 

(LOC  (ViLOE  (PORT  OUT  (PART  XOR1  (PART  (IUH  FA  2.)  POO)))  AX)  («.  6.)) 
(LOC  (VILOE  (PORT  III  (PART  OR1  (PART  (IUH  FA  3.)  POO)))  AX)  (2.  0.)) 
(LOC  (ViLOE  (PORT  II2  (PIRT  0R1  (PiRT  (IUH  PA  3.)  POO)))  AX)  (2.  1.)) 
(LOC  (ViLOE  (PORT  II2  (PiRT  UD2  (PiRT  (IUH  PA  3.)  FOO)))  AX)  (4.  0.)) 
(LOC  (ViLOE  (PORT  III  (PiRT  X0R2  (PART  (IUH  PA  3.)  POO)))  AX)  IIL) 

(LOC  (ViLOE  (PORT  OUT  (PART  AID2  (PART  (IUH  PA  3.)  POO)))  AX)  (3.  0.)) 
(LOC  (ViLOE  (PORT  OOT  (PART  AID1  (PiRT  (IUH  PA  3.)  FOO)))  AX)  (3.  2.)) 
(LOC  (VALUE  (PORT  OUT  (PART  0R1  (PART  (IUH  FA  3.)  POO)))  AX)  (1.  0.)) 
(LOC  (ViLOE  (PORT  OOT  (PART  X0R2  (PART  (IUH  PA  3.)  POO)))  AX)  IIL) 

(LOC  (ViLOE  (PORT  OOT  (PART  X0R1  (PART  (IUH  PA  3.)  POO)))  AX)  (1.  S.)> 
(LOC  (VILOE  (PORT  HI  (PART  0R1  (PART  (IUH  PA  4.)  POO)))  AX)  (E.  4.)) 
(LOC  (VILOE  (PORT  II2  (PART  0R1  (PART  (IUH  PA  4.)  POO)))  AX)  (4.  4.)) 
(LOC  (ViLOE  (PORT  H2  (PIRT  AID2  (PART  (IUH  PA  4.)  POO)))  AX)  (7.  S.)) 
(LOC  (ViLOE  (PORT  III  (PART  I0R2  (PART  (IUH  PA  4.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (PORT  OOT  (PART  AID2  (PART  (IUH  FA  4.)  POO)))  AX)  (8.  4.)) 
(LOC  (ViLUE  (PORT  OUT  (PART  ilDl  (PART  (IUH  Pi  4.)  FOO)))  AX)  (4.  4.)) 
(LOC  (VILOE  (PORT  OOT  (PiRT  0R1  (PART  (IUH  PA  4.)  FOO)))  AX)  (4.  4.)) 
(LOC  (VILOE  (PORT  OOT  (PiRT  X0R2  (PART  (IUH  PA  4.)  POO)))  AX)  IIL) 

(LOC  (VILOE  (PORT  OOT  (PART  X0R1  (PART  (IUH  FA  4.)  FOO)))  AX)  (8.  6.)) 


E.7.2  Allocation  Database  for  Multiple  Copy  Case 

In  this  database,  facts  of  the  form 

(LOC  <FACT-PATTERN >  <CLUSTER>) 

are  intended  to  mean  that  the  partition  specified  by 

(PARTITION  <FACT-PATTERN>) 

should  be  allocated  to  the  cluster  of  processors  <CLUSTER>.  A  cluster  specified 
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as  (<n>  PA1  (<PA1>  <PA2>. . .  <PAn>)  denotes  <n>  processors  with  the  ad¬ 
dresses  <PA1>,  <PA2>,. . .  ,<PAn>.  PAl  is  the  central  processor  in  this  cluster 
of  processors.  A  cluster  specified  by  nil  indicates  that  no  processors  belong  to  the 
cluster  (i.e.,  the  related  partition  is  not  allocated  to  any  processor).  This  is  reason¬ 
able  if  the  partition  is  not  used  at  all  for  proving  the  goals  specified  at  compile-tune. 
The  location  database  is  given  below. 


(LOC  (VALUE  (PORT  HI  (PART  (TOE  FA  1.)  F00))  AX)  (19.  (4.  1.)  (ft-  *•>  (0.  4.) 
(2.  1.)  (3.  2.)  (4.  3.)  (S.  3.)  (6.  3.)  (6.  2.)  (2.  6.)  (4.  0.)  (3.  0.)  (3.  1.) 


(8.  8.)  (7.  8.)  (2.  0.) 
(4.  2.)  (5.  2.)  (S.  1.) 


(4.  1.)))) 

(LOC  (VALUE  (PORT  III  (PART  (IUI!  FA  2.)  F00>)  AX)  (7. 
3.)  (2.  3.)))) 

(LOC  (VALUE  (PORT  III  (PART  (IUH  FA  3.)  FOO))  AX)  (7. 


(2.  3.)  ((2.  2.)  (1.  2.)  (1.  3.)  <2.  4.)  (3.  4.)  (3. 
(2.  4.)  ((2.  3.)  (1.  3.)  (1.  4.)  (2.  6.)  (3.  6.)  (3. 


4.)  (2.  4.)))) 

(LOC  (VALUE  (PORT  III  (PART  (IUH  FA  4.)  FOO))  AX)  (7.  (4.  3.)  ((4.  2.)  (3.  2.)  (3.  3.)  (4.  4.)  (5.  4.)  (5 


3.)  (4.  3.)))) 

(LOC  (VALUE  (PORT  H2  (PART  (IUH  FA  1.)  FOO))  AX)  (19.  (0.  0.)  ((8.  8.)  (5.  7.) 
(7.  4.)  (8.  B.)  (0.  2.)  <1.  2.)  (2.  2.)  (2.  1.)  (2.  0.)  (B.  8.)  (4.  8.)  (8.  4.) 

(0.  0.)))) 

(LOC  (VALUE  (PORT  112  (PART  (IUH  FA  2.)  FOO))  AX)  (7.  (4.  2.)  ((4.  1.)  (3.  1.) 
2.)  (4.  2.)))) 

(LOC  (VALUE  (PORT  X12  (PART  (TOE  FA  3.)  FOO))  AX)  (7.  (7.  7.)  ((7.  6.)  (6.  6.) 


(4.  7.)  (3.  7.)  (7.  3.) 


(0.  1.)  (1.  1.)  (1.  0.) 


(3.  2.)  (4.  3.)  (B.  3.)  (5. 
(8.  7.)  (7.  8.)  (8.  8.)  (8. 


7.)  (7.  7.)))) 

(LOC  (VALUE  (PORT  IE2  (PART  (TO*  FA  4.)  FOO))  AX)  (7.  (3.  B.)  ((3.  4.)  (2.  4.)  (2.  B.)  (3.  6.)  (4.  8.) 


(4. 


B.)  (3.  6.)))) 

(LOC  (VALUB  (PORT  CII  (PART  (TOE  FA  1.)  FOO))  AX)  (19.  (8.  4.)  ((6.  8.)  (4.  7.)  (3.  7.) 
(8.  4.)  (7.  B.)  (8.  6.)  (0.  2.)  (1.  2.)  (1.  1.)  Cl.  0.)  (4.  8.)  (7.  3.)  (7.  4.)  (8.  B.) 


(6.  2.)  (8.  3.) 
(0.  1.)  (0.  0.) 


(8.  4.)))) 

(LOC  (VALUE  (PORT  COUT  (PART  (TO*  FA  1.)  FOO))  AX)  (7.  (7.  7.)  ((7.  6.)  (6.  6.) 
(8.  7.)  (7.  7.)))) 

(LOC  (VALUB  (PORT  COUT  (PART  (TO*  FA  2.)  FOO))  AX)  (7.  (B.  4.)  ((B.  3.)  (4.  3.) 


(8.  7.)  (7.  8.)  (8.  8.) 
(4.  4.)  (B.  B.)  (8.  6.) 


(8.  4.)  (6.  4.)))) 

(LOC  (VALUE  (PORT  COUT  (PART  (TO*  FA  3.)  FOO))  AX)  (1.  (B.  6.)  ((B.  B.)))) 

(LOC  (VALUE  (PORT  COUT  (PART  (TO*  FA  4.)  FOO))  AX)  (1.  (4.  4.)  ((4.  4.)))) 

(LOC  (VALUE  (PORT  SU*  (PART  (TO*  FA  1.)  FOO))  AX)  IIL) 

(LOC  (VALUE  (PORT  SU*  (PART  (TO*  FA  2.)  FOO))  AX)  IIL) 

(LOC  (VALUE  (PORT  SU*  (PART  CEU*  FA  3.)  FOO))  AI)  IIL) 
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(LOC  (VALUE  (POET  SOT  (PAET  (EOT  Fi  4.)  FOO))  *X)  BIL) 

(LOC  (VALUE  (POET  HI  (PIET  AIDS  (PAET  (EOT  FA  1.)  FOO)))  »I)  (19.  (8.  4.) 
2.)  (8.  3.)  (6.  4.)  (7.  S.)  (8.  6.)  (0.  2.)  (1.  3.)  (1.  1.)  Cl.  0.)  (4.  8. 

1.)  (0.  0.)  (8.  4.)))) 

(LOC  (VALUE  (POET  IB1  (PAET  AE02  (PAET  (EOT  FA  2.)  FOO)))  EX)  (7.  (S.  6.) 

(6.  e.)  (6.  8.)  (5.  8.)))) 

(LOC  (VALUE  (POET  HI  (PAET  AED2  (PAET  (EOT  FA  3.)  FOO)))  AX)  (7.  (2.  4.) 
(3.  8.)  (3.  4.)  (2.  4.)))) 

(LOC  (VALUE  (POET  HI  (PAET  AED2  (PAET  (EOT  FA  4.)  FOO)))  »X)  (1.  (4.  8.) 

(LOC  (VALUE  (POET  H2  (PAET  X0E2  (PAET  (EOT  FA  1.)  FOO)))  »X)  BXL) 

(LOC  (VALUE  (POET  H2  (PAET  X0E2  (PAET  (EOT  FA  2.)  FOO)))  *X)  BIL) 

(LOC  (VALUE  (POET  H2  (PAET  I0E2  (PAET  (EOT  FA  3.)  FOO)))  *1)  BIL) 

(LOC  (VALUE  (POET  H2  (PAET  X0E2  (PAET  (EOT  FA  4.)  FOO)))  *X)  BIL) 

(LOC  (VALUE  (POET  H2  (PAET  ABD1  (PAET  (EOT  FA  1.)  FOO)))  81)  (7.  (0.  0.) 

(1.  1.)  (1.  0.)  (0.  0.)))) 

(LOC  (VALUE  (POET  H2  (PAET  AED1  (PAET  (EOT  FA  2.)  FOO)))  8X)  (7.  (4.  2.) 
(8.  3.)  (8.  2.)  (4.  2.)))) 

(LOC  (VALUE  (POET  H2  (PAET  ABD1  (PAET  (EOT  FA  3.)  FOO)))  8X)  (1.  (3.  3.) 

(LOC  (VALUE  (POET  H2  (PAET  AB01  (PAET  (EOT  FA  4.)  FOO)))  8X)  (1.  (6.  4.) 

(LOC  (VALUE  (POET  H2  (PAET  XOEi  (PAET  (EOT  FA  1.)  FOO)))  81)  (19-  (4.  8. 

3.)  (2.  4.)  (2.  6.)  (3.  6.)  (4.  7.)  (8.  7.)  (6.  7.)  (8.  6.)  (6.  8.)  (4.  4 

8.)  (6.  8.)  (4.  8.)))) 

(LOC  (VALUE  (POET  IB2  (PAET  XOEI  (PAET  (EOT  FA  2.)  FOO)))  8X)  (7.  (1.  8.: 

(2.  6.)  (2.  6.)  (1.  8.)))) 

(LOC  (VALUE  (POET  H2  (PAET  XOEI  (PAET  (EOT  FA  3.)  FOO)))  81)  (7.  (7.  7.! 
(8.  8.)  (8.  7.)  (7.  7.)))) 

(LOC  (VALUE  (POET  H2  (PAET  XOEI  (PAET  (EOT  FA  4.)  FOO)))  8X)  (1.  «■  B‘ 
(LOC  (VALUE  (POET  HI  (PAET  AED1  (PAET  (EOT  FA  1.)  FOO)))  8X)  (19.  (4.  4 
2.)  (2.  3.)  (2.  4.)  (3.  8.)  (4.  8.)  (8.  8.)  (6.  6.)  (8.  6.)  (6.  4.)  (4. 

5.)  (8.  4.)  (4.  4.)))) 

(LOC  (VALUE  (POET  HI  (PAET  ABD1  (PAET  (EOT  FA  2.)  FOO)))  81)  (7.  (2.  3. 
(3.  4.)  (3.  3.)  (2.  3.)))) 

(LOC  (VALUE  (POET  HI  (PAET  ABD1  (PAET  (EOT  FA  3.)  FOO)))  81)  (7.  (2.  4. 
(3.  5.)  (3.  4.)  (3.  4.)))) 

(LOC  (VALUE  (POET  HI  (PAET  ABD1  (PAET  (EOT  FA  4.)  FOO)))  81)  (1.  (4.  3. 
(LOC  (VALUE  (POET  HI  (PAET  XOEI  (PAET  (EOT  FA  1.)  FOO)))  8X)  (19.  (4.  1 
8.)  (2.  0.)  (2.  1.)  (3.  2.)  (4.  3.)  (8.  3.)  (8.  3.)  (8.  2.)  (2.  8.)  (4. 

2.)  (5.  1.)  (4.  !.))» 
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((8.  8.)  (4.  7.)  (3.  7.)  (8. 

)  (7.  3.)  (7.  4.)  (8.  8.)  (0. 

((8.  4.)  (4.  4.)  (4.  8.)  (6.  6.) 

((2.  3.)  (1.  3.)  (1.  4.)  (2.  8.) 

((4.  8.)))) 


((6.  8.)  (4.  8.)  (8.  4.)  (0.  1.) 

((4.  1.)  (3.  1.)  (3.  2.)  (4.  3.) 

((3.  3.)))) 

((8.  4.)))) 

)  ((8.  4.)  (4.  3.)  (3.  3.)  (2. 

1.)  (3.  4.)  (3.  6.)  (4.  6.)  (8. 

I  ((1.  4.)  (0.  4.)  (4.  0.)  (6.  1.) 

>  ((7.  6.)  (6.  6.)  (6.  7.)  (7.  8.) 

)  ((3.  8.)))) 

.)  ((6.  3.)  (4.  2.)  (3.  2.)  (2. 

3.)  (3.  3.)  (3.  4.)  (4.  8.)  (8. 

)  ((2.  2.)  (1.  2.)  (1.  3.)  (2.  4.) 

)  ((2.  3.)  (1.  3.)  (1.  4.)  (2.  8.) 

.)  ((4.  3.)))) 

L.)  ((1.  6.)  (0.  4.)  (8.  8.)  (7. 

0.)  (3.  0.)  (3.  1.)  (4.  2.)  (8. 
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(LOC  (VALUE  (POET  HI  (PART  XOR1  (PART  (TOM  PA  2.)  rOO)))  *1)  (7.  (T.  4.)  ((7.  3.)  («.  3.)  (8.  4.)  (7.  S.) 
(8.  S.)  (8.  4.)  (7.  4.)))) 

(LOC  (VALUE  (PORT  III  (PART  XOR1  (PART  (TOM  FA  3.)  POO)))  81)  (7.  (4.  8.)  ((4.  6.)  (3.  S.)  (3.  8.)  (4.  7.) 
(B.  7.)  (6.  6.)  (4.  6.)))) 

(LOC  (VALUE  (PORT  Ill  (PART  XOR1  (PART  (TOM  PA  4.)  POO)))  RX)  (1.  (»•  3.)  ((2.  3.)))) 

(LOC  (VALUE  (PORT  CXI  (PART  (TOM  PA  3.)  POO))  8X)  (7.  (8.  7.)  ((8.  6.)  (7.  8.)  (7.  7.)  (8.  8.)  (0.  4.)  (0. 

3. )  (8.  7.)))) 

(LOC  (VALUE  (PORT  CXI  (PART  (TOM  PA  3.)  POO))  RX)  (7.  (2.  4.)  ((2.  3.)  (1.  3.)  (1.  4.)  (2.  6.)  (3.  S.)  (3. 

4. )  (2.  4.)))) 

(LOC  (VALUE  (PORT  CXI  (PART  (TOM  FA  4.)  POO))  RX)  (1.  (5.  ••)  ((B*  ••)))) 

(LOC  (VALUE  (PORT  III  (PART  0R1  (PART  (TOM  PA  1.)  POO)))  AI)  (19.  (*■  6.)  ((6.  4.)  (4.  3.)  (3.  3.)  (2.  3.) 

(2.  4.)  (2.  B.)  (3.  6.)  (4.  7.)  (B.  7.)  (8.  7.)  (8.  8.)  (8.  S.)  (4.  4.)  (3.  4.)  (3.  B.)  (4.  6.)  (B.  6.) 

(B.  B.)  (4.  B.)») 

(LOC  (VALUE  (PORT  112  (PART  0R1  (PART  (TOM  FA  1.)  POO)))  RX)  (19.  (0.  0.)  ((8.  8.)  (B.  7.)  (4.  7.)  (3.  7.) 
(7.  3.)  (7.  4.)  (8.  B.)  (0.  2.)  (1.  2.)  (2.  2.)  (2.  1.)  (2.  0.)  (B.  8.)  (4.  8.)  (8.  4.)  (0.  1.)  (1.  l-> 

(1.  0.)  (0.  0.)))) 

(LOC  (VALUE  (PORT  112  (PART  AID2  (PART  (TOM  PA  1.)  POO)))  RX)  (19.  (4.  0.)  ((1.  4.)  (0.  3.)  (8.  7.)  (7. 

7.)  (7.  8.)  (2.  0.)  (3.  1.)  (4.  2.)  (B.  2.)  (8.  2.)  (2.  6.)  (3.  B.)  (0.  4.)  (8.  8.)  (3.  0.)  (4.  1.)  (S. 

1.)  (1.  8.)  (4.  0.)))) 

(LOC  (VALUE  (PORT  HI  (PART  X0R2  (PART  (TOM  PA  1.)  POO)))  RX)  IXL) 

(LOC  (VALUE  (PORT  OUT  (PART  AID2  (PART  (TOM  PA  1.)  POO)))  RX)  (19.  (4.  0.)  ((1.  4.)  (0.  3.)  (8.  7.)  (7. 

7.)  (7.  8.)  (2.  0.)  (3.  1.)  (4.  2.)  (B.  2.)  (6.  2.)  (2.  8.)  (2.  B.)  (0.  4.)  (8.  8.)  (3.  0.)  (4.  1.)  (5. 

1.)  (1.  B.)  (4.  0.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  AID1  (PART  (TOM  PA  1.)  POO)))  RX)  (19.  (0.  0.)  ((8.  8.)  (B.  7.)  (4.  7.)  (3. 

7. )  (7.  3.)  (7.  4.)  (8.  B.)  (0.  2.)  (1.  2.)  (2.  2.)  (2.  1.)  (2.  0.)  (B.  8.)  (4.  8.)  (8.  4.)  (0.  1.)  (1. 

1.)  (1.  0.)  (0.  0.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  0R1  (PART  (TOM  PA  1.)  POO)))  RX)  (19.  (4.  B.)  ((S.  4.)  (4.  3.)  (3.  3.)  (2.  3.) 
(2.  4.)  (2.  B.)  (3.  6.)  (4.  7.)  (B.  7.)  (8.  7.)  (8.  8.)  (8.  B.)  (4.  4.)  (3.  4.)  (3.  B.)  (4.  6.)  (B.  8.) 

(6.  6.)  (4.  6.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  X0R2  (PART  (TOM  PA  1.)  POO)))  RX)  1IL) 

(LOC  (VALUE  (PORT  OUT  (PART  X0R1  (PART  (TOM  PA  1.)  POO)))  RX)  (19.  (4.  B.)  ((8.  4.)  (4.  3.)  (3.  3.)  (2. 

3.)  (2.  4.)  (2.  S.)  (3.  8.)  (4.  7.)  (6.  7.)  (6.  7.)  (8.  8.)  (8.  B.)  (4.  4.)  (3.  4.)  (3.  B.)  (4.  6.)  (S. 

8. )  (S.  6.)  (4.  6.)))) 

(LOC  (VALUE  (PORT  III  (PART  0R1  (PART  (TOM  PA  2.)  POO)))  RX)  (7.  (4.  2.)  ((4.  1.)  (3.  1.)  (3.  2.)  (4.  3.) 
(6.  3.)  (6.  2.)  (4.  2.)))) 

(LOC  (VALUE  (PORT  112  (PART  0R1  (PART  (TOM  PA  2.)  POO)))  RX)  (7.  (2.  3.)  ((2.  2.)  (1.  2.)  (1.  3.)  (2.  4.) 
(3.  4.)  (3.  3.)  (2.  3.)))) 
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(LOC  (VALUE  (PORT  II2  (PART  AID2  (PART  (BUM  FA  2.)  FOO)))  *1)  (7.  (4.  7.)  ((4.  8.)  (3.  6.)  (3.  7 
(5.  8.)  (5.  7.)  (4.  7.)))) 

(LOC  (VALUE  (PORT  III  (PART  I0R2  (PART  (IUM  FA  2.)  FOO)))  AX)  in.) 

(LOC  (VALUE  (PORT  OUT  (PART  A1D2  (PART  (EUR  FA  2.)  FOO)))  AI)  (7.  (7.  4.)  ((7.  3.)  (8.  3.)  (6.  4 

(8.  S.)  (8.  4.)  (7.  4.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  AID1  (PART  (EUH  FA  2.)  FOO)))  AX)  (7.  (2.  3.)  ((2.  2.)  (1.  2.)  (1.  3 
(3.  4.)  (3.  3.)  (2.  3.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  OR1  (PART  (IUM  FA  2.)  FOO)))  AX)  (7.  (4.  2.)  ((4.  1.)  (3.  1.)  (3.  2.. 
(5.  3.)  (S.  2.)  (4.  2.)))) 

(LOC  (VALUB  (PORT  OUT  (PART  XOR2  (PART  (IUM  FA  2.)  FOO)))  AI)  IIL) 

(LOC  (VALUE  (PORT  OUT  (PART  IOR1  (PART  (IUM  FA  2.)  FOO)))  AX)  (7.  (1.  0.)  ((6.  8.)  (S.  8.)  (0.  0 

(2.  1.)  (2.  0.)  (1.  0.)))) 

(LOC  (VALUE  (PORT  III  (PART  0R1  (PART  (IUM  FA  3.)  FOO)))  AX)  (7.  (6.  6.)  ((6.  4.)  (4.  4.)  (4.  5. 

(6.  6.)  (8.  6.)  (5.  S.)))) 

(LOC  (VALUE  (PORT  II2  (PART  0R1  (PART  (IUM  FA  3.)  FOO)))  AI)  (7.  (S.  S.)  ((6.  4.)  (4.  4.)  (4.  8. 
(6.  6.)  (8.  5.)  (6.  $.)))) 

(LOC  (VALUE  (PORT  112  (PART  AID2  (PART  (IUM  FA  3.)  FOO)))  AX)  (7.  (8.  8.)  ((8.  4.)  (4.  4.)  (4.  6 

(6.  8.)  (8.  8.)  (8.  6.)))) 

(LOC  (VALUE  (PORT  III  (PART  X0R2  (PART  (IUM  FA  3.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (PORT  OUT  (PART  AID2  (PART  (IUM  FA  3.)  FOO)))  AI)  (7.  (8.  8.)  ((8.  4.)  (4.  4.)  (4.  6 

(8.  8.)  (8.  8.)  (6.  8.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  AID1  (PART  (IUM  FA  3.)  FOO)))  AX)  (7.  (8.  6.)  ((8.  4.)  (4.  4.)  (4.  8 

(6.  6.)  (6.  8.)  (8.  6.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  0R1  (PART  (IUM  FA  3.)  FOO)))  AX)  (7.  (6.  8.)  ((6.  4.)  (4.  4.)  (4.  6. 

(6.  6.)  (8.  8.)  (8.  8.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  X0R2  (PART  (IUM  FA  3.)  FOO)))  AX)  IIL) 

(LOC  (VALUB  (PORT  OUT  (PART  X0R1  (PART  (IUM  FA  3.)  FOO)))  AX)  (7.  (8.  6.)  ((8.  4.)  (4.  4.)  (4.  8 

(8.  6.)  (6.  8.)  (8.  6.)))) 

(LOC  (VALUB  (PORT  III  (PART  0R1  (PART  (IUM  FA  4.)  FOO)))  AX)  (1.  (6.  4.)  ((8.  4.)))) 

(LOC  (VALUE  (PORT  II2  (PART  0R1  (PART  (IUM  FA  4.)  FOO)))  AX)  (1.  (4.  4.)  ((4.  4.)))) 

(LOC  (VALUE  (PORT  II2  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  AI)  (1.  (4.  4.)  ((4.  4.)))) 

(LOC  (VALUB  (PORT  III  (PART  I0R2  (PART  (BUM  FA  4.)  FOO)))  AX)  IIL) 

(LOC  (VALUE  (PORT  OUT  (PART  AID2  (PART  (IUH  FA  4.)  FOO)))  AX)  (1.  (8.  8.)  ((8.  8.)))) 

(LOC  (VALUB  (PORT  OUT  (PART  AIM  (PART  (BUM  FA  4.)  FOO)))  AX)  (1.  (8.  4.)  ((6.  4.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  0R1  (PART  (BUM  FA  4.)  FOO)))  AX)  (1.  (4.  4.)  ((4.  4.)))) 

(LOC  (VALUE  (PORT  OUT  (PART  I0R2  (PART  (BUM  FA  4.)  FOO)))  AI)  IIL) 

(LOC  (VALUE  (PORT  OUT  (PART  X0R1  (PART  (IUM  FA  4.)  FOO)))  AX)  (1.  (3.  4.)  ((3.  4.)))) 


.)  (4.  8.) 

.)  (7.  S.) 

.)  (2.  4.) 

)  (4.  3.) 

.)  (1.  1.) 

)  (8.  6.) 

)  (8.  6.) 

.)  (6.  8.) 

.)  (8.  8.) 

.)  (8.  6.) 

)  (6.  6.) 

.)  (8.  6.) 


Bibliography 


[1]  G.A.  Agha.  Actors:  A  Model  of  Concurrent  Computation  m  Distributed  Sys¬ 
tems.  Technical  Report  844,  MIT  AI  Laboratory,  March  1985. 

[2]  A.  V.  Aho,  M.  R.  Garey,  and  J.  D.  Ullman.  The  Transitive  Reduction  of  a 
Directed  Graph.  SIAM  Journal  on  Computing ,  1(2):131-137,  June  1972. 

[3]  A.  V.  Aho,  J.  E.  Hop  croft,  and  J.  D.  Ullman.  Data  Structures  and  Algorithms. 
Addison- Wesley  Publishing  Company,  1983. 

[4]  Bill  Athas.  Fine  Grain  Concurrent  Computations.  PhD  thesis,  Computer 
Science  Department,  California  Institute  of  Technology,  1987.  Also  published 
as  technical  report  5242:TR:87. 

[5]  J.  Backus.  Can  programming  be  liberated  from  the  von  Neumann  style?  A 
functional  style  and  its  algebra  of  programs.  Communications  of  the  ACM, 
21(8):613-641,  August  1978. 

[6]  A.  Barr  and  E.  A.  Feigenbaum,  editors.  The  Handbook  of  Artificial  Intelligence, 
chapter  12,  page  80.  Volume  3,  William  Kauffman,  Inc.,  Los  Altos,  California, 

1982. 

[7]  A.  Barr  and  E.  A.  Feigenbaum,  editors.  The  Handbook  of  Artificial  Intelligence, 
chapter  2,  page  39.  Volume  1,  William  Kauffman,  Inc.,  Los  Altos,  California, 

1981. 


199 


200 


BIBLIOGRAPHY 


[8]  L.  Bic.  A  Data-Driven  Model  for  Parallel  Interpretation  of  Logic  Programs. 

I„  Proceeding,  of  the  International  Conference  on  Fifth  Generation  Computer 
Systems ,  pages  517-523,  ICOT,  1984. 

[91  P.  Borgwardt.  Parallel  Prolog  using  Stack  Segments  on  Shared-Memory  Multi¬ 
processors.  In  IEEE  Logic  Programming  Conference,  pages  2-11,  IEEE,  Febru- 

ary  1984. 

[10]  Michael  L.  Campbell.  Static  Allocation  for  a  Data  Flow  Multiprocessor. 
In  Proceedings  of  the  1985  International  Conference  on  Parallel  Processing , 
pages  511-517,  IEEE,  1985. 

[11]  Martha  J.  Chamberlain  and  A.L.  Davis.  A  Static  Resource  Allocation  Method¬ 
ology  for  a  Dataflow  Multiprocessor.  Copies  may  be  obtained  from  A.L.  Davrs 
at  Hewlett  Packard  Labs,  1501  Page  Mill  Rd.,  Bldg.  3U,  Palo  Alto,  CA  94304. 

[12]  A.  Ciepielewski  and  S.  Haridi.  A  Formal  Model  for  Or-Parallel  Execution  of 
Logic  Programs.  In  Proceeding,  of  the  IFIP  Congre,,,  pages  299-305,  IFIP, 

1983. 

[13]  Wayne  Citrin.  Parallel  Unification  Scheduling  in  Prolog.  1985.  Can  be  ob¬ 
tained  from  the  Aquarius  group,  512  Evans  Hall,  Berkeley,  CA  94720. 

[14]  K.  L.  Clark  and  S.  Gregory.  A  Relational  Language  for  Parallel  Programming. 
In  Proceeding,  of  the  Conference  on  Funcionot  Programming  and  Computer 
Architecture,  pages  171-178,  Association  for  Computing  Machinery,  October 

1981. 

[15]  John  S.  Conery.  The  And/Or  Process  Model  for  Parallel  Interpretation  of  Logic 
Program,.  PhD  thesis,  University  of  California,  Irvine,  1983. 

[16]  John  S.  Conery  and  Dennis  F.  Kibler.  AND  Parallelism  in  Logic  Programs. 
In  Proceeding,  of  the  International  Joint  Conference  on  Artificial  Intelhgence, 

pages  539-543,  1983. 


BIBLIOGRAPHY 


201 


[17]  John  S.  Conery  and  Dennis  F.  Kibler.  Parallel  Interpretation  of  Logic  Pro¬ 
grams.  In  Proceedings  of  the  Conference  on  Functional  Programming  and 
Computer  Architecture ,  pages  163-170,  Association  for  Computing  Machinery, 

October  1981. 

[18]  A.  L.  Davis  and  S.  V.  Robison.  The  Architecture  of  the  FAIM-1  Symbolic 
Multiprocessing  System.  In  Proceedings  of  IJ CAI-85,  Morgan  Kaufmann  Pub¬ 
lishers,  Inc.,  August  1985. 

[19]  D.  DeGroot.  Restricted  And-Parallelism.  In  Proceedings  of  the  International 
Conference  on  Fifth  Generation  Computer  Systems ,  pages  471-478,  ICOT, 
Japan,  1984. 

[20]  C.  Dwork,  P.  C.  Kanellakis,  and  J.  C.  Mitchell.  On  the  Sequential  Nature  of 
Unification.  The  Journal  of  Logic  Programming,  l(l):35-50,  June  1984. 

[21]  J.  Finger  and  M.  Genesereth.  Residue:  A  Deductive  Approach  to  Design  Syn¬ 
thesis.  Technical  Report  HPP-85-1,  Heuristic  Programming  Project,  Computer 
Science  Department,  Stanford  University,  January  1985. 

[22]  M.  J.  Flynn.  Very  High-Speed  Computing  Systems.  Proceedings  of  IEEE , 
54:1901-1909,  December  1966. 

[23]  Charles  L.  Forgy.  OPS 5  User’s  Manual.  Technical  Report  CMU-CS-81-135, 
Computer  Science  Department,  Carnegie  Mellon  University,  1981. 

[24]  Gordon  Foyster.  Helios:  User’s  Manual.  Technical  Report  HPP-84-34,  Heuris¬ 
tic  Programming  Project,  Computer  Science  Department,  Stanford  University, 

August  1984. 

[25]  K.  Furukawa,  K.  Nitta,  and  Y.  Matsumoto.  Prolog  Interpreter  Based  on  Con¬ 
current  Programming.  In  Proceedings  of  the  First  International  Logic  Pro¬ 
gramming  Conference,  pages  38-44,  1982. 


202 


BIBLIOGRAPHY 


[26]  R.P.  Gabriel  and  J.  McCarthy.  Queue-based  Multi-processing  Lisp.  In  1984 
ACM  Symposium  on  Lisp  and  Functional  Programming ,  pages  25-44,  August 

1984. 

[27]  Michael  R.  Garey  and  David  S.  Johnson.  Computers  and  Intractability:  A 
Guide  to  the  Theory  of  NP-  Completeness.  W.  H.  Freeman  and  Company,  San 
Francisco,  1979. 

[28]  M.  R.  Genesereth.  The  Use  of  Design  Descriptions  in  Automated  Diagno¬ 
sis.  Technical  Report  HPP-81-20,  Heuristic  Programming  Project,  Computer 
Science  Department,  Stanford  University,  1984. 

[29]  R.  L.  Graham.  Bounds  on  Multiprocessing  Timing  Anomalies.  SIAM  Journal 
of  Applied  Mathematics ,  17(2):416-429,  March  1969. 

[30]  S.  Gregory.  Design ,  Application  and  Implementation  of  a  Parallel  Logic  Pro¬ 
gramming  Language.  PhD  thesis,  Imperial  College  of  Science  and  Technology, 

1985. 

[31]  R.  Halstead.  Multilisp:  A  Language  for  Concurrent  Symbolic  Computation. 
ACM  Transactions  on  Programming  Languages  and  Systems ,  7(4):501-538,  Oc¬ 
tober  1985. 

[32]  R.  Halstead.  Parallel  symbolic  computing.  IEEE  Computer ,  19(8):35-43,  Au¬ 
gust  1986. 

[33]  Haridi,  Seif  and  Ciepielewski,  Andrzej.  An  Or-Parallel  Token  Machine.  Tech¬ 
nical  Report  TRITA-CS-8303,  Department  of  Telecommunication  Systems  - 
Computer  Systems,  The  Royal  Institute  of  Technology,  Sweden,  May  1983. 

[34]  M.V.  Hermenegildo.  Relating  Goal  Scheduling,  Precedence,  and  Memory  Man¬ 
agement  in  AND-parallel  Execution  of  Logic  Programs.  In  Proceedings  of  the 
Fourth  International  Conference  on  Logic  Programming ,  1987. 

[35]  Hillis,  W.  Daniel.  The  Connection  Machine.  A.I.  Memo  646,  Artificial  Intelli¬ 
gence  Laboratory,  M.I.T.,  September  1981.  Revised  report  under  preparation. 


BIBLIOGRAPHY 


203 


[36]  D.A.  Homig.  Automatic  Partitioning  and  Scheduling  on  a  Network  of  Personal 
Computers.  PhD  thesis,  Department  of  Computer  Science,  Carnegie  Mellon 
University,  1984.  Also  available  as  technical  report  CMU-CS-84-165. 

[37]  B.  J.  Lageweg,  E.  L.  Lawler,  J.  K.  Lenstra,  and  A.  H.  G.  Rinnooy  Kan.  Com¬ 
puter  Aided  Complexity  Classification  of  Deterministic  Scheduling  Problems. 
Technical  Report  BW  138/81,  Mathematisch  Centrum,  Amsterdam,  1981. 

[38]  B.  W.  Lampson,  editor.  Distributed  Systems.  Lecture  Notes  in  Computer 
Science,  Springer-Verlag,  1981. 

[39]  E.  L.  Lawler,  J.  K.  Lenstra,  and  A.  H.  G.  Rinooy  Kan.  Recent  Developments  m 
Deterministic  Scheduling.  Technical  Report  BW  146/81,  Mathematisch  Cen- 
trum,  Amsterdam,  1982. 

[40]  P.P.  Li.  A  Parallel  Execution  Model  for  Logic  Programming.  PhD  thesis, 
Computer  Science  Department,  California  Institute  of  Technology,  1986.  Also 
published  as  technical  report  5227:TR:86. 

[41]  G.  Lindstrom  and  P.  Panangaden.  Stream-Based  Execution  of  Logic  Programs. 
In  IEEE  Logic  Programming  Conference,  pages  168-176,  IEEE,  February  1984. 

[42]  Malone,  T.  W.,  R.  E.  Fikes,  and  M.  T.  Howard.  Enterprise:  A  Market-like  Task 
Scheduler  for  Distributed  Computing  Environments.  Working  Paper,  Cognitive 
and  Instructional  Sciences  Group,  Xerox  Palo  Alto  Research  Center,  Palo  Alto, 
California,  October  1983. 

[43]  E.  W.  Mayr.  Well  Structured  Parallel  Programs  Are  Not  Easier  to  Schedule. 
Report  No.  STAN-CS-81-880,  Stanford  University,  September  1981. 

[44]  Moon,  David  A.  Architecture  of  the  Symbolics  3600.  In  The  Proceedings  of  the 
12th  Annual  International  Symposium  on  Computer  Architecture,  pages  76-83, 

1985. 

[45]  T.  Moto-okaand  H.  S.  Stone.  Fifth  Generation  Computer  Systems:  A  Japanese 
Project.  Computer,  17(3):6— 13,  March  1984. 


204 


BIBLIOGRAPHY 


[46]  Multimax  Technical  Summary.  Encore  Computer  Corporation,  257  Cedar  Hill 
Street,  Marlborough,  MA  01752. 

[47]  B.J.  Nelson.  Remote  Procedure  Call  PhD  thesis,  Department  of  Computer 
Science,  Carnegie  Mellon  University,  1981. 

[48]  N.  J.  Nilsson.  Principles  of  Artificial  Intelligence.  Tioga  Publishing  Company, 
1980. 

[49]  Kemal  Oflazer.  Partitioning  in  Parallel  Processing  of  Production  Systems. 
In  Proceedings  of  the  International  Conference  on  Parallel  Processing ,  IEEE, 

1984. 

[50]  Kemal  Oflazer.  Partitioning  in  Parallel  Processing  of  Production  Systems. 
PhD  thesis,  Carnegie  Mellon  University,  March  1987. 

[51]  C.H.  Papadimitriou  and  K.  Steiglitz.  Combinatorial  Optimization:  Algorithms 
and  Complexity.  Prentice-Hall,  Inc,  1982. 

[52]  S.  Pappert.  Mindstorms:  Children,  Computers,  and  Powerful  Ideas.  Basic 
Books,  1980. 

[53]  Ian  Robinson.  A  Prolog  Processor  Based  on  a  Pattern  Matching  Memory 
Device.  In  Ehud  Shapiro,  editor,  Proceedings  of  the  Third  International  Con¬ 
ference  on  Logic  Programming ,  pages  172-179,  Springer-Verlag,  July  1986. 

[54]  Stuart  Russell.  The  Compleat  Guide  to  MRS.  Technical  Report  KSL-85-12, 
Knowledge  Systems  Laboratory,  Computer  Science  Department,  Stanford  Uni- 

versity,  June  1985. 

[55]  Vivek  Sarkar.  Partitioning  and  Scheduling  Parallel  Programs  for  Execution  on 
Multiprocessors.  PhD  thesis,  Electrical  Engineering  Department,  Stanford  Uni¬ 
versity,  April  1987.  Also  available  as  Computer  Systems  Laboratory  Technical 
Report  No.  CSL-TR-87-328. 


BIBLIOGRAPHY 


205 


[56]  C.  L.  Seitz.  The  Cosmic  Cube.  Communications  of  the  ACM,  28(l):22-33, 
1985. 

[57]  E.  Y.  Shapiro.  A  Subset  of  Concurrent  Prolog  and  Its  Interpreter.  Technical 
Report  TR-003,  ICOT,  Japan,  January  1983. 

[58]  Ehud  Shapiro.  Systolic  programming:  a  paradigm  of  parallel  processing.  In 
Proceedings  of  the  International  Conference  on  Fifth  Generation  Computer 
Systems,  ICOT,  1984. 

[59]  N.  Singh.  Exploiting  Design  Morphology  to  Manage  Complexity.  PhD  thesis, 
Electrical  Engineering  Department,  Stanford  University,  August  1985. 

[60]  N.  Singh.  MARS:  A  Multiple  Abstraction  Rule-Based  Simulator.  Technical 
Report  HPP-83-43,  Heuristic  Programming  Project,  Computer  Science  De¬ 
partment,  Stanford  University,  1983. 

[61]  Vineet  Singh  and  Michael  R.  Genesereth.  A  Variable  Supply  Model  for  Dis¬ 
tributing  Deductions.  In  Proceedings  of  IJCAI-85,  Morgan  Kaufmann  Pub¬ 
lishers  Inc.,  August  1985. 

[62]  Vineet  Singh  and  Michael  R.  Genesereth.  PM:  A  Parallel  Execution  Model 
for  Backward-Chaining  Deductions.  Future  Computing  Systems,  l(3):271-308, 
1986.  Also  available  as  KSL  Report  KSL-85-18,  May  1985,  Knowledge  Systems 
Laboratory,  Computer  Science  Department,  Stanford  University. 

[63]  Vineet  Singh  and  Michael  R.  Genesereth.  PM:  A  Parallel  Execution  Model  for 
Backward- Chaining  Deductions.  KSL  Report  KSL-85-18,  Knowledge  Systems 
Laboratory,  Computer  Science  Department,  Stanford  University,  May  1985. 

[64]  D.  E.  Smith  and  M.  R.  Genesereth.  Ordering  Conjunctive  Queries.  Artificial 
Intelligence ,  26(2):171— 215,  October  1985. 

[65]  Smith,  D.  E.  Controlling  Inference.  PhD  thesis,  Department  of  Computer 
Science,  Stanford  University,  August  1985. 


206 


BIBLIOGRAPHY 


[66]  Smith,  R.  G.  The  Contract  Net  Protocol:  High-Level  Communication  and 
Control  in  a  Distributed  Problem  Solver.  IEEE  Transactions  on  Computers, 
C-29(12):1104-1113,  December  1980. 

[67]  K.  S.  Stevens.  1985.  Private  communication. 

[68]  K.  S.  Stevens,  S.  V.  Robison,  and  A.  L.  Davis.  The  Post  Office- 
Communication  Support  for  Distributed  Ensemble  Architectures.  In  Proceed¬ 
ings  of  the  6th  International  Conference  on  Distributed  Computing  Systems, 
pages  160-166,  IEEE  Computer  Society,  May  1986. 

[69]  Richard  Treitel.  Sequentialization  of  Logic  Programs.  PhD  thesis,  Stanford 
University,  September  1986.  Also  available  as  Report  No.  STAN-CS-86-1135, 
Department  of  Computer  Science,  Stanford  University,  Stanford,  CA  94305. 

[70]  P.  C.  Treleaven,  D.  R.  Brownbridge,  and  R.  P.  Hopkins.  Data-Driven  and 
Demand-Driven  Computer  Architecture.  Computing  Surveys ,  14(1):93-143, 
March  1982. 

[71]  K.  Ueda.  Guarded  Horn  Clauses.  Technical  Report  TR-103,  ICOT,  Tokyo, 
1985. 

[72]  D.H.D.  Warren.  An  Abstract  Prolog  Instruction  Set.  Technical  Note  309,  SRI 
International,  AI  Center,  Computer  Science  and  Technology  Division,  1983. 

[73]  D.H.D.  Warren.  Implementing  Prolog— Compiling  Predicate  Logic  Programs. 
Research  Reports  39  and  40,  Department  of  Artificial  Intelligence,  University 
of  Edinburgh,  1977. 

[74]  H.  Yasuura.  On  Parallel  Computational  Complexity  of  Unification.  In  Pro¬ 
ceedings  of  International  Conference  on  Fifth  Generation  Computers,  Tokyo, 
Japan,  November  1984. 


NTIS  does  not  permit  return  of  items  for  credit 
or  refund.  A  replacement  will  be  provided  if  an  error 
is  made  in  filling  your  order,  if  the  item  was  received 
in  damaged  condition,  or  if  the  item  is  defective. 


Reproduced  by  NTIS 

National  Technical  Information  Service 
Springfield,  VA  22161 


This  report  was  printed  specifically  for  your  order 
from  nearly  3  million  titles  available  in  our  collection. 


For  economy  and  efficiency,  NTIS  does  not  maintain  stock  of  its  vast 
collection  of  technical  reports.  Rather,  most  documents  are  printed  for 
each  order.  Documents  that  are  not  in  electronic  format  are  reproduced 
from  master  archival  copies  and  are  the  best  possible  reproductions 
available.  If  you  have  any  questions  concerning  this  document  or  any 
order  you  have  placed  with  NTIS,  please  call  our  Customer  Service 
Department  at  (703)  487-4660. 

About  NTIS 

NTIS  collects  scientific,  technical,  engineering,  and  business  related 
information  —  then  organizes,  maintains,  and  disseminates  that 
information  in  a  variety  of  formats  —  from  microfiche  to  online  services. 
The  NTIS  collection  of  nearly  3  million  titles  includes  reports  describing 
research  conducted  or  sponsored  by  federal  agencies  and  their 
contractors;  statistical  and  business  information;  U.S.  military 
publications;  audiovisual  products;  computer  software  and  electronic 
databases  developed  by  federal  agencies;  training  tools;  and  technical 
reports  prepared  by  research  organizations  worldwide.  Approximately 
100,000  new  titles  are  added  and  indexed  into  the  NTIS  collection 
annually. 


For  more  information  about  NTIS  products  and  services,  call  NTIS 
at  (703)  487-4650  and  request  the  free  NTIS  Catalog  of  Products 
and  Services,  PR-827LPG,  or  visit  the  NTIS  Web  site 

http://www.ntis.gov. 


NTIS 

Your  indispensable  resource  for  government-sponsored 
information — U.S.  and  worldwide 


